🤖 AI Summary
Multimodal large language models (MLLMs) frequently exhibit behavioral infidelity—where reasoning steps contradict the final output—and perceptual infidelity—where reasoning diverges from visual inputs—leading to hallucinations and unstable inference. This paper is the first to explicitly distinguish these two fidelity dimensions and proposes FaithAct, a planning-and-action framework centered on evidence anchoring: it constrains each reasoning step via visual evidence, introduces stepwise and chain-level fidelity evaluation mechanisms, and establishes FaithEval—a quantitative, multi-granular benchmark for fidelity assessment. Evaluated across multiple multimodal reasoning benchmarks, FaithAct improves perceptual fidelity by up to 26% without sacrificing task accuracy, significantly mitigating hallucinations and enhancing inference trajectory stability. Key innovations include (1) a dichotomous fidelity modeling paradigm, (2) an evidence-driven reasoning control framework, and (3) a quantifiable, multi-granular fidelity evaluation system.
📝 Abstract
Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.