Large (Vision) Language Models are Unsupervised In-Context Learners

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the reliance of large language models (LLMs) and vision-language models (VLMs) on manual prompt engineering or labeled examples for downstream task adaptation. We propose the first fully unsupervised adaptive joint inference framework that requires neither labels, handcrafted prompts, nor in-context examples. Our method synergistically integrates unsupervised fine-tuning and unsupervised in-context learning, underpinned by joint inference modeling, efficient approximate optimization, and a cross-modal unified adaptation architecture—enabling zero-shot task transfer. On GSM8K, it achieves an absolute accuracy gain of 39 percentage points over zero-shot baselines. Extensive evaluations across diverse models—including Llama-3.1, Qwen2.5-Math, OpenFlamingo, and GPT-4o—and multiple reasoning and multimodal tasks demonstrate consistent and significant improvements. To our knowledge, this is the first work to empirically validate that unsupervised adaptation can match the performance of supervised methods, establishing a new paradigm for label-free model customization.

Technology Category

Application Category

📝 Abstract

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

Problem

Research questions and friction points this paper is trying to address.

Eliminates manual prompt engineering and labeled examples

Enables unsupervised adaptation for diverse models

Improves performance over zero-shot inference methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint inference framework for unsupervised adaptation

Efficient approximation techniques for optimization

Unsupervised fine-tuning and ICL methods

🔎 Similar Papers

No similar papers found.