Better Language Models Exhibit Higher Visual Alignment

#1 Better Language Models Exhibit Higher Visual Alignment [PDF⁶] [Copy] [Kimi⁴⁷] [REL]

Authors: Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

How well do text-only Large Language Models (LLMs) naturally align with the visual world? We provide the first direct analysis by utilizing frozen text representations in a discriminative vision-language model framework and measuring zero-shot generalization on unseen classes. We find decoder-based LLMs exhibit high intrinsic visual alignment. In particular, more capable LLMs reliably demonstrate stronger generalization. Moreover, utilizing frozen LLMs leads to strong gains in cross-lingual settings, where our approach surpasses CLIP's accuracy of 1.4% with 38.7% for Chinese. Our proposed method improves both robustness and generalization and also significantly reduces the need for paired data and compute, making vision-language models more accessible and adaptable.

Subjects: Computation and Language , Artificial Intelligence , Computer Vision and Pattern Recognition

Publish: 2024-10-09 17:59:33 UTC

2410.07173

#1 Better Language Models Exhibit Higher Visual Alignment [PDF6] [Copy] [Kimi47] [REL]

#1 Better Language Models Exhibit Higher Visual Alignment [PDF⁶] [Copy] [Kimi⁴⁷] [REL]