DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal language models (MLMs) frequently rely on image-irrelevant regions for visual question answering (VQA), resulting in unfaithful reasoning and poor interpretability. To address this, we propose a counterfactual image reasoning framework that jointly optimizes answer accuracy and reasoning faithfulness via a three-phase training paradigm—positive examples, counterfactual examples, and random masking—combined with an automated evidence localization pipeline. We further introduce GRPO (Generalized Reinforcement Learning with Policy Optimization) with a multi-objective reward mechanism to end-to-end align answer generation with visual grounding. Evaluated on a newly constructed high-quality counterfactual VQA dataset of 100K samples, our method achieves substantial improvements across multiple benchmarks: +2.3%–4.1% in answer accuracy and +18.7% in reasoning faithfulness, while enhancing model robustness and interpretability. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of "thinking with images," which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.
Problem

Research questions and friction points this paper is trying to address.

Multimodal models rely on irrelevant image regions for reasoning
Flawed reasoning persists even when answers are correct
Models lack true image understanding despite correct responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual reasoning framework for multimodal models
Three training paradigms with automatic evidence localization
GRPO reinforcement learning with complementary reward design
🔎 Similar Papers
No similar papers found.
T
Tianrun Xu
Department of Automation, Tsinghua University
H
Haoda Jing
Department of Automation, Tsinghua University
Y
Ye Li
Department of Computer Science and Technology, Xinjiang University
Y
Yuquan Wei
Fuzhou University
J
Jun Feng
Institute of Automation, Chinese Academy of Sciences
G
Guanyu Chen
Department of Automation, Tsinghua University
H
Haichuan Gao
Department of Automation, Tsinghua University
Tianren Zhang
Tianren Zhang
Tsinghua University
Representation learningGeneralizationLearning theoryReinforcement learningMachine learning
F
Feng Chen
Department of Automation, Tsinghua University