🤖 AI Summary
To address the trade-off between model scale, inference cost, and performance in vision-language models (VLMs), this paper proposes a master-apprentice collaborative inference framework: a large “Master” model generates and caches high-quality reasoning outputs, while a small “Apprentice” model retrieves relevant cached results via multimodal retrieval and augments its inference through dynamic in-context learning (ICL). The key contribution is the novel *thought caching* mechanism—the first cache-driven paradigm for vision-language tasks—integrating multimodal retrieval, dynamic ICL, and lightweight cache management into a distillation-inspired collaborative architecture. Evaluated on mainstream VQA benchmarks under fixed computational budgets, the framework achieves an average 7.7% improvement in overall accuracy; the Apprentice model alone attains up to a 36.6% absolute accuracy gain over its standalone counterpart, significantly outperforming both isolated small models and existing baselines.
📝 Abstract
Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general VQA benchmarks, and show that CoT increases overall VQA performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%.