π€ AI Summary
Existing multimodal reasoning models lack a real-time, trustworthy visual attribution mechanism, making it difficult to verify whether their reasoning genuinely relies on semantically relevant regions of the input image. This work proposes an amortized causal attribution framework that leverages rich signals embedded in attention features to estimate, in real time, the causal effect of semantic image regions on model outputs and generates visual attributions streamingly during inference. For the first time in multimodal reasoning models, this approach simultaneously achieves causal faithfulness and real-time performance without requiring repeated backpropagation or input perturbations. Experiments across five benchmarks and four state-of-the-art models demonstrate that the method matches the attribution quality of exhaustive causal baselines while enabling users to observe the modelβs decision rationale interactively and instantaneously.
π Abstract
We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.