TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) frequently generate factually incorrect hallucinations lacking visual grounding during vision-language reasoning, undermining reliability. To address this, we propose a novel DPO-based training framework grounded in MinMax optimization, which formulates preference learning as a token-level adaptive distribution calibration problem under semantic constraints. Our method explicitly models alignment uncertainty to strengthen causal visual-linguistic associations and mitigate overfitting to spurious preference patterns. It integrates direct preference optimization (DPO), semantic consistency constraints, and dynamic token-wise weighting. Evaluated on multiple benchmarks, our approach reduces hallucination rate from 26.4% to 13.2% and cognitive bias score from 2.5 to 0.4 using only 4.8K samples—outperforming standard DPO and matching the reliability of GPT-4o.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.
Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in multimodal large language models
Overcomes overfitting to static preference supervision signals
Improves grounding in causally relevant visual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-adaptive preference strategy for DPO
Min-max optimization for alignment uncertainty
Semantic constraints to preserve causal grounding
K
Kejia Zhang
Department of Artificial Intelligence, Xiamen University
Keda Tao
Keda Tao
Westlake University
Generative ModelComputer VisionMLLM
Zhiming Luo
Zhiming Luo
Xiamen University
Computer VisionDeep LearningMachine Learning
C
Chang Liu
AWS AI Lab, Amazon
J
Jiasheng Tang
DAMO Academy, Alibaba Group
H
Huan Wang
School of Engineering, Westlake University