HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In safety-critical image restoration (e.g., medical imaging, remote sensing), generative models often produce semantic hallucinations—spurious yet plausible structures—that compromise system reliability; current evaluation relies heavily on costly, subjective human annotations. Method: We propose the first diffusion-based controllable hallucination synthesis framework enabling precise control over hallucination type, spatial location, and severity; construct the first large-scale (4,350 images) hallucination dataset with fine-grained semantic annotations; and design SHAFE, a reference-free evaluation metric leveraging segmentation consistency and soft attention aggregation, alongside a robust hallucination detection model. Contributions/Results: Synthesized hallucinations drastically reduce segmentation IoU (0.86 → 0.36); SHAFE significantly improves hallucination sensitivity; and the detection model demonstrates strong generalization on real-world failure cases. Our work establishes a benchmark platform for hallucination assessment, detection, and mitigation in safety-critical vision systems.

Technology Category

Application Category

📝 Abstract
Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic hallucinations to evaluate image restoration models
Addressing lack of labeled data for hallucination assessment in critical domains
Enabling systematic evaluation and detection of hallucinations in medical imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework synthesizes realistic controllable hallucinations
Large-scale hallucination dataset enables systematic evaluation and detection
Feature-based metric SHAFE improves hallucination sensitivity over traditional metrics
🔎 Similar Papers
No similar papers found.