Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) frequently generate visually ungrounded hallucinations during multimodal reasoning and lack human-like pre-reasoning mechanisms—such as conceptual clarification and salient-point summarization. To address this, we propose Re-Critic, a rationality-enhanced framework that introduces the novel “critique-first” paradigm: visual rationales are first generated and self-critiqued, followed by chain-of-thought (CoT) reasoning. Re-Critic explicitly integrates rationale synthesis, CoT, and in-context self-critique into the instruction-tuning pipeline. By leveraging contrastive preference optimization, it significantly improves visual grounding of model responses. Empirical results demonstrate substantial gains in hallucination suppression robustness and strong generalization to general multimodal reasoning tasks, achieving state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Despite significant advancements in multimodal reasoning tasks, existing Large Vision-Language Models (LVLMs) are prone to producing visually ungrounded responses when interpreting associated images. In contrast, when humans embark on learning new knowledge, they often rely on a set of fundamental pre-study principles: reviewing outlines to grasp core concepts, summarizing key points to guide their focus and enhance understanding. However, such preparatory actions are notably absent in the current instruction tuning processes. This paper presents Re-Critic, an easily scalable rationale-augmented framework designed to incorporate fundamental rules and chain-of-thought (CoT) as a bridge to enhance reasoning abilities. Specifically, Re-Critic develops a visual rationale synthesizer that scalably augments raw instructions with rationale explanation. To probe more contextually grounded responses, Re-Critic employs an in-context self-critic mechanism to select response pairs for preference tuning. Experiments demonstrate that models fine-tuned with our rationale-augmented dataset yield gains that extend beyond hallucination-specific tasks to broader multimodal reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

LVLMs produce visually ungrounded responses in multimodal reasoning
Current instruction tuning lacks preparatory actions like human learning principles
Need scalable framework to enhance reasoning with rationale-augmented tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rationale-augmented framework enhances reasoning abilities
Visual rationale synthesizer scales explanation augmentation
In-context self-critic mechanism selects response pairs
🔎 Similar Papers
No similar papers found.
Z
Zexian Yang
Institute of Information Engineering, Chinese Academy of Sciences; Foundation Technology Center, Tencent PCG
Dian Li
Dian Li
Tencent.com
MLLMvideo understandingself-supervised learningvision-language
D
Dayan Wu
Institute of Information Engineering, Chinese Academy of Sciences
G
Gang Liu
Foundation Technology Center, Tencent PCG
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security