Online In-Context Distillation for Low-Resource Vision Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Deploying vision-language models (VLMs) in low-resource, budget-constrained settings is challenging due to the high computational cost of large teacher models and the poor zero-shot performance—and expensive fine-tuning—of small student models. Method: We propose Online Contextual Distillation, a fine-tuning-free, inference-time knowledge transfer framework. It integrates three key components: cross-modal example selection, test-time teacher expansion, and uncertainty-driven dynamic demonstration pool construction—significantly reducing teacher query frequency. Contribution/Results: Under a stringent constraint of only 4% teacher-annotated data, our method boosts small-model accuracy by 33%, matching the teacher’s zero-shot performance and outperforming conventional fine-tuning. It enables efficient, compute-budget-aware adaptation without parameter updates, making it particularly suitable for resource-limited deployment scenarios.

Technology Category

Application Category

📝 Abstract

As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models for low-resource budget-constrained settings

Bridging performance gap between small and large models efficiently

Enhancing small models using scarce teacher annotations at inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online In-Context Distillation for small VLMs

Cross-modal demonstration selection strategy

Student uncertainty conditioning to minimize queries

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs