Zhang_Learning_Visual_Proxy_for_Compositional_Zero-Shot_Learning@ICCV2025@CVF

Total: 1

#1 Learning Visual Proxy for Compositional Zero-Shot Learning [PDF2] [Copy] [Kimi] [REL]

Authors: Shiyu Zhang, Cheng Yan, Yang Liu, Chenchen Jing, Lei Zhou, Wenjun Wang

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Existing methods typically align textual prototypes with visual features using Vision-Language Models (VLMs), but they face two key limitations: (1) modality gaps hinder the ability to distinguish semantically similar attribute-object pairs, and (2) textual prototypes alone lack the fine-grained visual cues needed for accurate recognition. To address these challenges, we propose Visual Proxy Learning, a method that reduces modality gaps and enhances compositional generalization by initializing visual proxies for attributes, objects, and their compositions from text representations and optimizing the visual space to better capture fine-grained visual cues. To further strengthen cross-modal understanding, we introduce Cross-Modal Joint Learning (CMJL), which enforces consistency between text-image embeddings and fine-grained visual representations. This dual strategy improves generalization to unseen compositions and enhances the discrimination of similar pairs. Extensive experiments demonstrate that our method achieves state-of-the-art performance in closed-world settings and competitive results in open-world scenarios across four CZSL benchmarks, validating its effectiveness in improving compositional generalization.

Subject: ICCV.2025 - Poster