Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLM

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a previously overlooked issue in multimodal large language models (MLLMs): *language prior conflict*—a misalignment between the inherent linguistic priors of the base LLM and the language distribution observed in multimodal training data, leading to biased vision-language alignment. To address this, we propose *Decoupled Proxy Alignment (DPA)*: a lightweight proxy language model is introduced to explicitly decouple and suppress dominant linguistic priors; additionally, a vision-relevance-aware dynamic loss weighting mechanism is designed to amplify gradient signals for critical tokens. DPA requires no modification to the backbone architecture and is compatible with diverse MLLM pretraining paradigms. Experiments across multiple datasets, model scales, and architectures demonstrate that DPA significantly mitigates language prior conflict, yielding consistent improvements in cross-modal alignment accuracy and generalization performance.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have gained significant attention due to their impressive ability to integrate vision and language modalities. Recent advancements in MLLMs have primarily focused on improving performance through high-quality datasets, novel architectures, and optimized training strategies. However, in this paper, we identify a previously overlooked issue, language prior conflict, a mismatch between the inherent language priors of large language models (LLMs) and the language priors in training datasets. This conflict leads to suboptimal vision-language alignment, as MLLMs are prone to adapting to the language style of training samples. To address this issue, we propose a novel training method called Decoupled Proxy Alignment (DPA). DPA introduces two key innovations: (1) the use of a proxy LLM during pretraining to decouple the vision-language alignment process from language prior interference, and (2) dynamic loss adjustment based on visual relevance to strengthen optimization signals for visually relevant tokens. Extensive experiments demonstrate that DPA significantly mitigates the language prior conflict, achieving superior alignment performance across diverse datasets, model families, and scales. Our method not only improves the effectiveness of MLLM training but also shows exceptional generalization capabilities, making it a robust approach for vision-language alignment. Our code is available at https://github.com/fnlp-vision/DPA.
Problem

Research questions and friction points this paper is trying to address.

Addresses language prior conflict in multimodal alignment
Mitigates mismatch between LLM priors and training data
Improves vision-language alignment in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy LLM decouples vision-language alignment
Dynamic loss adjustment for visual relevance
Mitigates language prior conflict in training
🔎 Similar Papers
No similar papers found.
Chenkun Tan
Chenkun Tan
Fudan University
P
Pengyu Wang
Fudan University
Shaojun Zhou
Shaojun Zhou
Fudan University
B
Botian Jiang
Fudan University
Zhaowei Li
Zhaowei Li
Moonshot AI
Computer VisionNatural Language Processing
D
Dong Zhang
Fudan University
Xinghao Wang
Xinghao Wang
Fudan University
Natural Language ProcessingLarge Language Models
Y
Yaqian Zhou
Fudan University
X
Xipeng Qiu
Fudan University