Large (Vision) Language Models are Unsupervised In-Context Learners

#1 Large (Vision) Language Models are Unsupervised In-Context Learners [PDF²] [Copy] [Kimi] [REL]

Authors: Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbic

Recent advancements in large language and vision-language models have made it possible to solve new tasks via zero-shot inference without task-specific training. Various adaptation techniques, such as In-Context Learning (ICL), supervised fine-tuning, and prompt engineering, can further enhance the model’s performance on a given task. However, these methods require either labeled examples or substantial manual effort to construct effective prompts. In this work, we introduce a joint inference framework extending the standard zero-shot inference. In contrast to independent zero-shot predictions, joint inference makes predictions simultaneously for all inputs for a given task. Since direct joint inference involves a computationally expensive optimization, we develop efficient approximation techniques resulting in two unsupervised adaptation methods that are compatible with language and vision-language models: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of both approaches across a broad range of tasks and models, including language-only Llama 3.1, vision-language Open-Flamingo and API-only access GPT-4o models. Our experiments reveal substantial improvements over the standard zero-shot approach. Furthermore, our approach, although unsupervised, often performs on par with supervised approaches that use ground truth labels.

Subject: ICLR.2025 - Poster

ohJxgRLlLt@OpenReview

#1 Large (Vision) Language Models are Unsupervised In-Context Learners [PDF2] [Copy] [Kimi] [REL]

#1 Large (Vision) Language Models are Unsupervised In-Context Learners [PDF²] [Copy] [Kimi] [REL]