Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing chain-of-thought (CoT) methods in multimodal large language models (MLLMs) struggle with dynamic spatial reasoning tasks due to limited dynamic perception and cross-frame state tracking capabilities. To address this, we propose Draft-Augmented Reasoning (D2R), a training-free framework that enables spatiotemporal joint reasoning by generating real-time dynamic visual sketches—overlaid onto input frames—and synergistically integrating them with textual CoT. Our key contributions include: (i) the first training-free paradigm unifying visual sketching and textual CoT; (ii) GRASSLAND, the first benchmark dedicated to dynamic spatial reasoning; and (iii) effective cross-frame spatial state modeling and zero-shot navigation reasoning. Experiments demonstrate that D2R significantly outperforms baselines on GRASSLAND, exhibits strong generalization, and enhances MLLMs’ dynamic spatial understanding without fine-tuning. Both code and the GRASSLAND benchmark are publicly released.

Technology Category

Application Category

📝 Abstract

While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

Problem

Research questions and friction points this paper is trying to address.

Bridging dynamic perception gap in multimodal reasoning

Enhancing spatial reasoning with visual-textual drafts

Training-free framework for dynamic spatial tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic visual drafts augment textual reasoning chains

Training-free D2R integrates CoT with visual drafts

GRASSLAND benchmark evaluates dynamic spatial reasoning

🔎 Similar Papers

No similar papers found.