Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

#1 Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation [PDF¹] [Copy] [Kimi²] [REL]

Authors: Kuanghong Liu, Jin Wang, Kangjian He, Dan Xu, Xuejie Zhang

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-03-08 07:17:06 UTC

2503.06106

#1 Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation [PDF1] [Copy] [Kimi2] [REL]

#1 Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation [PDF¹] [Copy] [Kimi²] [REL]