🤖 AI Summary
This work addresses the susceptibility of large vision-language models to object hallucination when linguistic priors conflict with visual evidence. To mitigate this issue, the authors propose a training-free, on-demand calibration framework applied during inference. The approach employs an uncertainty-driven confidence gating mechanism that triggers contrastive decoding at low-confidence generation steps. Concurrently, it leverages attention-guided local visual perturbations to construct a negative decoding branch, which actively suppresses hallucinatory outputs. Evaluated across multiple benchmarks—including POPE, AMBER, MME, MMHal-Bench, and CHAIR—the method significantly outperforms existing training-free baselines while maintaining computational efficiency during inference.
📝 Abstract
Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.