When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large multimodal models (LMMs) suffer from semantic hallucinations when recognizing ambiguous, corrupted, or distorted scene text, primarily due to overly strong semantic priors. To address this, we propose ZoomText—a training-free, hallucination-mitigating framework. First, we empirically reveal an inverse correlation between attention focus in Transformer layers and hallucination severity. Second, we design a detector-agnostic, coarse-to-fine text localization strategy. Third, we introduce Grounded Layer Correction, a mechanism that adaptively retrieves robust visual representations to guide decoding with explicit visual grounding. For systematic evaluation, we construct TextHalu-Bench—the first benchmark dedicated to scene-text hallucination assessment. Experiments demonstrate that ZoomText reduces error rates by 42.3% on TextHalu-Bench while maintaining state-of-the-art performance on mainstream benchmarks including IC13 and COCO-Text.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

Problem

Research questions and friction points this paper is trying to address.

Mitigating semantic hallucinations in Large Multimodal Models

Improving scene text spotting and understanding accuracy

Addressing visually ambiguous or non-semantic text challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

ZoomText identifies text regions without detectors

Grounded Layer Correction adaptively guides decoding

TextHalu-Bench evaluates 1,730 semantic and non-semantic samples

🔎 Similar Papers

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification