Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from object hallucination—generating text inconsistent with image content—undermining their reliability. While existing inference-time intervention methods mitigate this issue, they require multiple forward passes, incurring prohibitive computational overhead and hindering low-latency deployment. To address this, we propose an efficient decoding regulation framework that operates in a single forward pass: it extracts visual evidence directly from self-attention layers; introduces context-activated residual direction vectors; and employs a Bayesian-inspired adaptive gating mechanism to inject token-wise residual correction signals. Evaluated on standard benchmarks—including POPE and CHAIR—our method achieves state-of-the-art performance while incurring negligible additional inference latency. It thus effectively bridges the accuracy–efficiency trade-off, enabling reliable, real-time LVLM inference without sacrificing fidelity.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs'reliability without a significant compromise on efficiency.
Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucination in Large Vision-Language Models
Reducing computational overhead of existing hallucination interventions
Balancing reliability and efficiency for real-world LVLM deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

RUDDER framework reduces hallucination with low overhead
CARD vector extracts visual evidence from residual updates
Adaptive gate injects token-wise corrective signals
🔎 Similar Papers
No similar papers found.
Z
Zhengtao Zou
Aalto University
Ya Gao
Ya Gao
Doctoral student, Aalto University
Natural Language ProcessingMachine Learning for Health
J
Jiarui Guan
Aalto University
B
Bin Li
Shenzhen Institutes of Advanced Technology
Pekka Marttinen
Pekka Marttinen
Aalto University
Statistical machine learningComputational biology