🤖 AI Summary
Deploying vision-language models (VLMs) in low-resource, budget-constrained settings is challenging due to the high computational cost of large teacher models and the poor zero-shot performance—and expensive fine-tuning—of small student models.
Method: We propose Online Contextual Distillation, a fine-tuning-free, inference-time knowledge transfer framework. It integrates three key components: cross-modal example selection, test-time teacher expansion, and uncertainty-driven dynamic demonstration pool construction—significantly reducing teacher query frequency.
Contribution/Results: Under a stringent constraint of only 4% teacher-annotated data, our method boosts small-model accuracy by 33%, matching the teacher’s zero-shot performance and outperforming conventional fine-tuning. It enables efficient, compute-budget-aware adaptation without parameter updates, making it particularly suitable for resource-limited deployment scenarios.
📝 Abstract
As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.