🤖 AI Summary
This work identifies a critical norm imbalance between visual and textual tokens in multimodal large language models (MLLMs) induced by pre-normalization architectures, leading to asymmetric update dynamics and representational inertia in visual tokens—severely hindering cross-modal feature fusion. We establish, for the first time, that norm disparity is a fundamental mechanism limiting fusion efficiency. To address this, we propose a minimalist intervention: inserting a single LayerNorm module immediately after the visual projection layer to align token norms across modalities. Our method requires no backbone modification or additional training overhead. Evaluated on mainstream architectures (e.g., LLaVA-1.5), it consistently improves multimodal understanding performance on benchmarks such as MMBench and OCRBench, while also boosting pure-language task accuracy (e.g., +1.2% on MMLU). These results demonstrate its transferability, computational efficiency, and capacity to enhance holistic representation learning.
📝 Abstract
Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.