🤖 AI Summary
This work addresses the tendency of current vision-language models to rely on spurious correlations in visual question answering, which obscures whether failures in causal reasoning stem from insufficient reasoning capacity or misidentification of causal information. To disentangle these factors, the authors propose the Vision-Language Causal Graph (VLCG), a structured, query-conditioned representation that explicitly models causally relevant objects, attributes, relations, and scene assumptions. They further introduce the ViLCaR diagnostic benchmark, featuring a novel graph alignment metric that decouples the identification of causal information from downstream reasoning ability. Experiments demonstrate that incorporating VLCG structures significantly improves causal attribution accuracy and reasoning consistency in mainstream models, outperforming both zero-shot and standard in-context learning approaches. These findings suggest that the primary bottleneck lies not in inherent reasoning capabilities but in the absence of explicit structural guidance.
📝 Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.