R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the prevalent issue of hallucination in large vision-language models (LVLMs), where models often generate descriptions of objects not present in the input image. To mitigate this, the authors propose Region-Aware Chain-of-Verification (RACoVe), a training-free, external-model-free post-processing mechanism that incorporates human-like fine-grained visual reasoning into LVLMs for the first time. RACoVe operates through a six-step pipeline—initial response generation, entity extraction, coordinate localization, region description, self-verification, and final response reconstruction—to guide the model in performing localized self-verification over specific image regions, thereby enabling region-level hallucination detection and correction. Extensive experiments demonstrate that RACoVe significantly reduces object hallucination rates across multiple LVLMs and benchmark datasets, showcasing both strong effectiveness and broad generalizability.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

Problem

Research questions and friction points this paper is trying to address.

object hallucinations

large vision-language models

multimodal understanding

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-aware Chain-of-Verification

Object Hallucination

Large Vision-Language Models

Post-hoc Verification