PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hallucination in multimodal large language models (MLLMs) demands efficient, low-overhead mitigation strategies. This paper proposes a training-free, zero-parameter-update adaptive KV cache pruning method—the first to introduce token-level pruning for hallucination suppression in MLLMs. It dynamically identifies and removes redundant visual tokens via attention-based analysis, thereby enhancing model focus on salient visual cues. The approach is architecture-agnostic, compatible with mainstream MLLMs (e.g., LLaVA, Qwen-VL) and diverse decoding strategies, with negligible inference latency overhead. Evaluated across four representative MLLMs and multiple standard hallucination benchmarks (e.g., HalluBench, MME-Hallu), it achieves consistent reductions in hallucination rates—averaging 12.6%–23.4%—while demonstrating strong robustness. This work establishes a lightweight, plug-and-play paradigm for hallucination mitigation in MLLMs.

Technology Category

Application Category

📝 Abstract
While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model's attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose extbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don't require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in multi-modal large language models
Addressing insufficient attention to critical visual tokens
Mitigating redundant visual tokens dispersing model focus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV cache pruning reduces hallucinations
Training-free method enhances visual attention focus
Model-agnostic approach integrates with decoding strategies
🔎 Similar Papers
No similar papers found.
F
Fengyuan Sun
School of Software, Tsinghua University
H
Hui Chen
School of Software, Tsinghua University
X
Xinhao Xu
School of Software, Tsinghua University
D
Dandan Zheng
Ant Group
J
Jingdong Chen
Ant Group
J
Jun Zhou
Ant Group
Jungong Han
Jungong Han
Chair Professor in Computer Vision, University of Sheffield, UK, FIAPR, FAAIA
Computer VisionVideo AnalyticsMachine Learning
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval