🤖 AI Summary
This work addresses the susceptibility of vision-language models to “multi-image reasoning hallucination” in spatiotemporal reasoning, evidenced by their markedly divergent performance on forward versus reverse temporal queries—a discrepancy indicating reliance on superficial cues rather than genuine causal mechanisms. To mitigate this, the authors propose a progressive training framework: first conducting supervised pretraining on a novel chain-of-thought (CoT) dataset explicitly designed for spatiotemporal reasoning to establish structured logical inference capabilities, followed by fine-tuning with weakly labeled data to enhance generalization. This approach substantially narrows the performance gap between forward and reverse queries—from over 70% to just 6.53%—demonstrating significantly improved understanding of dynamic causal relationships, reasoning accuracy, and model robustness.
📝 Abstract
Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70\% to only 6.53\%. This confirms the method's ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.