The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work identifies a critical norm imbalance between visual and textual tokens in multimodal large language models (MLLMs) induced by pre-normalization architectures, leading to asymmetric update dynamics and representational inertia in visual tokens—severely hindering cross-modal feature fusion. We establish, for the first time, that norm disparity is a fundamental mechanism limiting fusion efficiency. To address this, we propose a minimalist intervention: inserting a single LayerNorm module immediately after the visual projection layer to align token norms across modalities. Our method requires no backbone modification or additional training overhead. Evaluated on mainstream architectures (e.g., LLaVA-1.5), it consistently improves multimodal understanding performance on benchmarks such as MMBench and OCRBench, while also boosting pure-language task accuracy (e.g., +1.2% on MMLU). These results demonstrate its transferability, computational efficiency, and capacity to enhance holistic representation learning.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ``asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic -- the persistence of norm disparity and the resulting asymmetric update rates -- is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.

Problem

Research questions and friction points this paper is trying to address.

Norm disparity between visual and text tokens in Pre-Norm MLLMs

Asymmetric update dynamic causing slow semantic transformation of visual tokens

Impaired cross-modal feature fusion due to architectural imbalance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single LayerNorm after visual projector for norm alignment

Addresses asymmetric update dynamic between visual and text tokens

Improves multimodal and text-only benchmark performance

🔎 Similar Papers

Law of Vision Representation in MLLMs