Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the trade-off between model scale, inference cost, and performance in vision-language models (VLMs), this paper proposes a master-apprentice collaborative inference framework: a large “Master” model generates and caches high-quality reasoning outputs, while a small “Apprentice” model retrieves relevant cached results via multimodal retrieval and augments its inference through dynamic in-context learning (ICL). The key contribution is the novel *thought caching* mechanism—the first cache-driven paradigm for vision-language tasks—integrating multimodal retrieval, dynamic ICL, and lightweight cache management into a distillation-inspired collaborative architecture. Evaluated on mainstream VQA benchmarks under fixed computational budgets, the framework achieves an average 7.7% improvement in overall accuracy; the Apprentice model alone attains up to a 36.6% absolute accuracy gain over its standalone counterpart, significantly outperforming both isolated small models and existing baselines.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general VQA benchmarks, and show that CoT increases overall VQA performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%.

Problem

Research questions and friction points this paper is trying to address.

Balancing VLM model size for cost and quality.

Improving small VLM performance using large VLM cache.

Enhancing VQA benchmarks with collaborative inference framework.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Master-apprentice framework for VLM collaboration

Cache stores high-quality results from large VLMs

Multi-modal retrieval boosts small VLM performance

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models