TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

📅 2025-07-29

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Multimodal large language models (MLLMs) frequently generate factually incorrect hallucinations lacking visual grounding during vision-language reasoning, undermining reliability. To address this, we propose a novel DPO-based training framework grounded in MinMax optimization, which formulates preference learning as a token-level adaptive distribution calibration problem under semantic constraints. Our method explicitly models alignment uncertainty to strengthen causal visual-linguistic associations and mitigate overfitting to spurious preference patterns. It integrates direct preference optimization (DPO), semantic consistency constraints, and dynamic token-wise weighting. Evaluated on multiple benchmarks, our approach reduces hallucination rate from 26.4% to 13.2% and cognitive bias score from 2.5 to 0.4 using only 4.8K samples—outperforming standard DPO and matching the reliability of GPT-4o.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in multimodal large language models

Overcomes overfitting to static preference supervision signals

Improves grounding in causally relevant visual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-adaptive preference strategy for DPO

Min-max optimization for alignment uncertainty

Semantic constraints to preserve causal grounding

🔎 Similar Papers

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models