Decoupling Vision and Language: Codebook Anchored Visual Adaptation

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of large vision-language models on domain-specific visual tasks and the poor transferability of existing adaptation methods, which tightly couple the vision encoder with the language model. To overcome these limitations, the authors propose CRAFT, a lightweight and parameter-efficient approach that decouples the vision encoder from the language model through a shared discrete codebook. This design maps continuous visual features into a stable token space without modifying the pre-trained language model, enabling seamless compatibility across diverse language architectures and eliminating the need for repeated alignment. Evaluated on ten domain-specific benchmarks, CRAFT achieves an average improvement of 13.51% over state-of-the-art continuous-token-based methods while fully preserving the original capabilities of the language model.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Domain Adaptation
Visual Representation
Model Decoupling
Medical Image Diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

codebook
decoupling
visual adaptation
domain-specific
discrete representation
🔎 Similar Papers
No similar papers found.