๐ค AI Summary
This study addresses the faithfulness of visual reasoning in multimodal large language models (MLLMs)โspecifically, whether their decisions genuinely rely on the input image rather than linguistic priors or memorized knowledge. To this end, we introduce VFaith-Bench, the first causally grounded benchmark for evaluating visual faithfulness, comprising 755 carefully curated items. We propose a quantitative paradigm termed *visual faithfulness*, featuring a cue-driven automated image editing pipeline to generate contrastive question-answer pairs sensitive to visual perturbations. Faithfulness is measured via performance-difference-based attribution metrics (e.g., ฮ-Acc, VisRel Score), supported by GPT-Image-1-assisted editing and human perceptual validation. Comprehensive evaluation across leading closed- and open-source MLLMs reveals consistently low visual faithfulness, exposing pervasive reliance on memorized patterns and a fundamental disconnect between model perception and actual visual content.
๐ Abstract
Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model's specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs' reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs' visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question's answer. By testing similar questions with images that have different details, the average accuracy reflects the model's visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model's reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.