Total: 1
Recent advancements in large language and vision-language models have made it possible to solve new tasks via zero-shot inference without task-specific training. Various adaptation techniques, such as In-Context Learning (ICL), supervised fine-tuning, and prompt engineering, can further enhance the model’s performance on a given task. However, these methods require either labeled examples or substantial manual effort to construct effective prompts. In this work, we introduce a joint inference framework extending the standard zero-shot inference. In contrast to independent zero-shot predictions, joint inference makes predictions simultaneously for all inputs for a given task. Since direct joint inference involves a computationally expensive optimization, we develop efficient approximation techniques resulting in two unsupervised adaptation methods that are compatible with language and vision-language models: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of both approaches across a broad range of tasks and models, including language-only Llama 3.1, vision-language Open-Flamingo and API-only access GPT-4o models. Our experiments reveal substantial improvements over the standard zero-shot approach. Furthermore, our approach, although unsupervised, often performs on par with supervised approaches that use ground truth labels.