🤖 AI Summary
Multimodal large language models (MLLMs) face a trade-off between visual representation quality and computational efficiency. Method: We propose the first “Visual Representation Law,” asserting that MLLM performance is jointly determined by cross-modal alignment and visual representation consistency, quantified via the Alignment–Consistency (AC) score. Based on empirical analysis across eight benchmarks, we observe a strong linear correlation (R² > 0.95) between AC score and model performance. Leveraging this law, we introduce an efficient adaptation strategy—fine-tuning only the vision encoder while freezing the language model—reducing training cost by 99.7%. Contribution/Results: Through systematic evaluation of thirteen vision encoder configurations, we demonstrate that the AC score effectively guides the selection of efficient visual architectures and lightweight adaptation. This work establishes an interpretable, reusable theoretical framework and practical paradigm for low-cost, high-performance visual representation design in MLLMs.
📝 Abstract
We present the"Law of Vision Representation"in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.