Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucinations—such as context omission, confusion, and misinterpretation—in multimodal large language models (MLLMs) caused by cross-modal misalignment during multi-image understanding, this paper proposes a hierarchical preference optimization framework. Methodologically, it introduces the first dual-granularity Direct Preference Optimization (DPO) mechanism, jointly operating at the *context level* (to correct sequential cognitive biases) and the *pinpoint level* (to achieve region-level visual alignment). We further construct MultiScope-42k, the first automatically synthesized dataset supporting multi-level optimization, and integrate region-directed visual prompting, multimodal preference supervision, and low-overhead global sequence modeling. Experiments demonstrate that our approach significantly suppresses diverse hallucinations in multi-image scenarios while delivering consistent performance gains on both single- and multi-image general-purpose benchmarks. This work establishes a scalable, alignment-centric paradigm for robust multi-image reasoning.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details. It features: (i) Context-Level Optimization : Re-evaluates cognitive biases underlying MLLMs' multi-image context comprehension and integrates a spectrum of low-cost global sequence preferences for bias mitigation. (ii) Needle-Level Optimization : Directs attention to fine-grained visual details through region-targeted visual prompts and multimodal preference supervision. To support scalable optimization, we also construct MultiScope-42k, an automatically generated dataset with high-quality multi-level preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks.
Problem

Research questions and friction points this paper is trying to address.

MLLMs struggle with multi-image understanding due to cross-modal misalignment
Existing methods neglect holistic context modeling in multi-image tasks
Proposed CcDPO enhances multi-image perception via hierarchical preference optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multi-level preference optimization framework
Context-to-Cue DPO for multi-image understanding
Automated dataset with multi-level preference pairs
🔎 Similar Papers
No similar papers found.
X
Xudong Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
M
Mengdan Zhang
Tencent Youtu Lab
Peixian Chen
Peixian Chen
Youtu Lab Tencent
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
Y
Yan Zhang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
J
Jingyuan Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Y
Yunhang Shen
Tencent Youtu Lab
K
Ke Li
Tencent Youtu Lab
Chaoyou Fu
Chaoyou Fu
Nanjing University
Multimodal LLMLLMBiometrics
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China