Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the hallucination problem in multimodal large language models (MLLMs), which often arises during decoding due to excessive attention to irrelevant image tokens, further exacerbated by existing forced-correction methods that intensify visual-linguistic imbalance. To mitigate this, the authors propose ACE (Adversarial Counterfactual Equilibration), a training-free, plug-and-play framework that reframes decoding as a dynamic game between vision and language modalities. ACE introduces counterfactual image patch perturbations to disrupt visual context, thereby suppressing language priors sensitive to such perturbations and reinforcing reliance on stable visual signals to restore multimodal equilibrium. Requiring no additional training, ACE significantly reduces hallucinations and enhances output reliability across multiple benchmarks while incurring negligible inference overhead, demonstrating strong practicality and broad applicability.

📝 Abstract

During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.

Problem

Research questions and friction points this paper is trying to address.

multimodal hallucination

vision-language imbalance

attention misalignment

MLLM decoding

visual-linguistic equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Counter-Commonsense Equilibrium

vision-language balance

hallucination suppression