🤖 AI Summary
This work addresses the tendency of large multimodal models (LMMs) to over-rely on textual priors while neglecting visual evidence in causal discovery, leading to biased reasoning. To diagnose this issue, the authors propose ProCauEval, an evaluation protocol that disentangles the contributions of visual and textual modalities through five controlled perturbations, offering the first mechanistic diagnosis of this deficiency. They further introduce Anti-Distillation Policy Optimization (ADPO), a framework that enforces visual-grounded causal reasoning by aligning with negative teachers and separating policy distributions via KL divergence. Experiments across 17 mainstream LMMs reveal that, despite accurately perceiving video content, these models fail to effectively leverage visual information for causal judgments. ADPO significantly enhances visual engagement without compromising foundational comprehension capabilities.
📝 Abstract
Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.