🤖 AI Summary
Multimodal large language models (MLLMs) frequently generate factually incorrect hallucinations lacking visual grounding during vision-language reasoning, undermining reliability. To address this, we propose a novel DPO-based training framework grounded in MinMax optimization, which formulates preference learning as a token-level adaptive distribution calibration problem under semantic constraints. Our method explicitly models alignment uncertainty to strengthen causal visual-linguistic associations and mitigate overfitting to spurious preference patterns. It integrates direct preference optimization (DPO), semantic consistency constraints, and dynamic token-wise weighting. Evaluated on multiple benchmarks, our approach reduces hallucination rate from 26.4% to 13.2% and cognitive bias score from 2.5 to 0.4 using only 4.8K samples—outperforming standard DPO and matching the reliability of GPT-4o.
📝 Abstract
Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.