HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Vision-language segmentation models frequently exhibit visual hallucinations—i.e., segmenting nonexistent or altered objects—yet existing evaluation protocols focus solely on label/text hallucinations and neglect visual context manipulation, hindering diagnosis of grounding failures. Method: We introduce the first counterfactual visual reasoning benchmark for segmentation hallucinations, comprising 1,340 image-level counterfactual instance pairs, and pioneer the application of counterfactual reasoning to this domain. We further propose a dedicated evaluation framework—Visual Grounding Fidelity (VGF)—to quantify grounding accuracy under visual perturbations. Contribution/Results: Our analysis reveals that vision-driven hallucinations are substantially more prevalent than label hallucinations: state-of-the-art models persistently output original segmentation masks despite semantic edits to input images, exposing severe grounding deficiencies. Crucially, conventional metrics significantly underestimate such failures, demonstrating the necessity of counterfactual visual evaluation for robust segmentation assessment.

Technology Category

Application Category

📝 Abstract

Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

Problem

Research questions and friction points this paper is trying to address.

Evaluates segmentation hallucinations in vision-language models

Addresses lack of visual context manipulation in current protocols

Measures hallucination sensitivity under coherent scene edits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual visual reasoning benchmark

Novel dataset with 1340 instance pairs

New metrics for hallucination sensitivity

🔎 Similar Papers

No similar papers found.