Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chain-of-thought (CoT) methods in multimodal large language models (MLLMs) struggle with dynamic spatial reasoning tasks due to limited dynamic perception and cross-frame state tracking capabilities. To address this, we propose Draft-Augmented Reasoning (D2R), a training-free framework that enables spatiotemporal joint reasoning by generating real-time dynamic visual sketches—overlaid onto input frames—and synergistically integrating them with textual CoT. Our key contributions include: (i) the first training-free paradigm unifying visual sketching and textual CoT; (ii) GRASSLAND, the first benchmark dedicated to dynamic spatial reasoning; and (iii) effective cross-frame spatial state modeling and zero-shot navigation reasoning. Experiments demonstrate that D2R significantly outperforms baselines on GRASSLAND, exhibits strong generalization, and enhances MLLMs’ dynamic spatial understanding without fine-tuning. Both code and the GRASSLAND benchmark are publicly released.

Technology Category

Application Category

📝 Abstract
While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.
Problem

Research questions and friction points this paper is trying to address.

Bridging dynamic perception gap in multimodal reasoning
Enhancing spatial reasoning with visual-textual drafts
Training-free framework for dynamic spatial tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic visual drafts augment textual reasoning chains
Training-free D2R integrates CoT with visual drafts
GRASSLAND benchmark evaluates dynamic spatial reasoning
🔎 Similar Papers
No similar papers found.
S
Siqu Ou
Shanghai Jiao Tong University
H
Hongcheng Liu
Shanghai Jiao Tong University
Pingjie Wang
Pingjie Wang
Shanghai Jiao Tong University
Model CompressionInference Acceleration
Yusheng Liao
Yusheng Liao
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory
Large Language ModelsClinical NLPAgentReasoning
C
Chuan Xuan
Shanghai Jiao Tong University
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
Shanghai Jiao Tong University, Shanghai AI lab