Law of Vision Representation in MLLMs

📅 2024-08-29

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

219K/year

🤖 AI Summary

Multimodal large language models (MLLMs) face a trade-off between visual representation quality and computational efficiency. Method: We propose the first “Visual Representation Law,” asserting that MLLM performance is jointly determined by cross-modal alignment and visual representation consistency, quantified via the Alignment–Consistency (AC) score. Based on empirical analysis across eight benchmarks, we observe a strong linear correlation (R² > 0.95) between AC score and model performance. Leveraging this law, we introduce an efficient adaptation strategy—fine-tuning only the vision encoder while freezing the language model—reducing training cost by 99.7%. Contribution/Results: Through systematic evaluation of thirteen vision encoder configurations, we demonstrate that the AC score effectively guides the selection of efficient visual architectures and lightweight adaptation. This work establishes an interpretable, reusable theoretical framework and practical paradigm for low-cost, high-performance visual representation design in MLLMs.

Technology Category

Application Category

📝 Abstract

We present the"Law of Vision Representation"in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Problem

Research questions and friction points this paper is trying to address.

Quantify vision representation impact on MLLM performance

Establish linear correlation between AC score and performance

Reduce computational cost by optimizing vision representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify vision representation with AC score

Linear correlation between AC score and performance

Optimal vision representation reduces computational cost

🔎 Similar Papers

No similar papers found.