🤖 AI Summary
Existing chain-of-thought (CoT) methods in multimodal large language models (MLLMs) struggle with dynamic spatial reasoning tasks due to limited dynamic perception and cross-frame state tracking capabilities. To address this, we propose Draft-Augmented Reasoning (D2R), a training-free framework that enables spatiotemporal joint reasoning by generating real-time dynamic visual sketches—overlaid onto input frames—and synergistically integrating them with textual CoT. Our key contributions include: (i) the first training-free paradigm unifying visual sketching and textual CoT; (ii) GRASSLAND, the first benchmark dedicated to dynamic spatial reasoning; and (iii) effective cross-frame spatial state modeling and zero-shot navigation reasoning. Experiments demonstrate that D2R significantly outperforms baselines on GRASSLAND, exhibits strong generalization, and enhances MLLMs’ dynamic spatial understanding without fine-tuning. Both code and the GRASSLAND benchmark are publicly released.
📝 Abstract
While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.