TerraMind: Large-Scale Generative Multimodality for Earth Observation

#1 TerraMind: Large-Scale Generative Multimodality for Earth Observation [PDF] [Copy] [Kimi] [REL]

Authors: Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

We present TerraMind, the first any-to-any generative, multi-modal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "thinking in modalities" (TiM)--the capability of generating additional artificial data during finetuning and inference to improve the model output--and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. All models and code have been open-sourced under a permissive license at https://huggingface.co/ibm-esa-geospatial and https://github.com/ibm/terramind.

Subject: ICCV.2025 - Poster

Jakubik_TerraMind_Large-Scale_Generative_Multimodality_for_Earth_Observation@ICCV2025@CVF

#1 TerraMind: Large-Scale Generative Multimodality for Earth Observation [PDF] [Copy] [Kimi] [REL]