Jose_DINOv2_Meets_Text_A_Unified_Framework_for_Image-_and_Pixel-Level@CVPR2025@CVF

Total: 1

#1 DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [PDF2] [Copy] [Kimi] [REL]

Authors: Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V. Vo, Patrick Labatut, Piotr Bojanowski

Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named `dino.txt`, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model, but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks,such as concatenating the [CLS] token with the patch average to train the alignment, curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.

Subject: CVPR.2025 - Poster