Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification

#1 Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification [PDF²] [Copy] [Kimi] [REL]

Authors: Junjie Zhang, Feng Zhao, Hanqiang Liu, Jun Yu

The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-13 19:52:45 UTC

2511.10774

#1 Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification [PDF2] [Copy] [Kimi] [REL]

#1 Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification [PDF²] [Copy] [Kimi] [REL]