Decoupling Vision and Language: Codebook Anchored Visual Adaptation

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limited performance of large vision-language models on domain-specific visual tasks and the poor transferability of existing adaptation methods, which tightly couple the vision encoder with the language model. To overcome these limitations, the authors propose CRAFT, a lightweight and parameter-efficient approach that decouples the vision encoder from the language model through a shared discrete codebook. This design maps continuous visual features into a stable token space without modifying the pre-trained language model, enabling seamless compatibility across diverse language architectures and eliminating the need for repeated alignment. Evaluated on ten domain-specific benchmarks, CRAFT achieves an average improvement of 13.51% over state-of-the-art continuous-token-based methods while fully preserving the original capabilities of the language model.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Domain Adaptation

Visual Representation

Model Decoupling

Medical Image Diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

codebook

decoupling

visual adaptation