Total: 1
Authors: JangHyeon Lee, Philipe Ambrozio Dias, Yao-Yi Chiang, Dalton Lunga
Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, given the breadth of tasks, task-relevant information may lie beyond the shared space, so retaining modality-specific (unique) features can improve task performance. Prior methods retain unique information through extra training objectives or external models, increasing training complexity. Motivated by the conventional wisdom that earlier layers capture general input features while later layers become task-specific, we hypothesize that intermediate layers in CL models retain more modality-specific structure than the alignment-optimized final layer. Through a trifecta layerwise analysis of modality gap, representation similarity, and mutual information, we validate this trend and find that fusing intermediate (more unique) and final (more shared) representations yields consistent gains on diverse geospatial tasks. Our findings reveal underutilized information diversity in CL models and show that simple layerwise fusion is an efficient path to richer geo-embeddings.
Subject: CVPR.2026 - Poster
Include(OR):
Exclude:
Stared Paper(s):
#1 Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
Magic Token:
Kimi Language:
Desc Language:
Bug report? Issue submit? Please visit:
Github: https://github.com/bojone/papers.cool
Please read our Disclaimer before proceeding.
For more interesting features, please visit kexue.fm and kimi.ai.