Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing spatial reasoning approaches predominantly rely on text-centric paradigms, inherently limiting their capacity to model precise geometric relationships and continuous spatial trajectories. To address this, we propose a novel “drawing-as-reasoning” paradigm, enabling vision-language models to directly represent and track spatial relations within the visual space via differentiable drawing operations—such as bounding box annotations and auxiliary lines—eliminating dependence on external perception modules. We introduce a three-stage, end-to-end training framework comprising cold-start initialization, reflective sampling, and reinforcement learning, coupled with a differentiable drawing modeling mechanism and multi-stage collaborative optimization. Our model, VILASR, achieves an average 18.4% improvement over prior state-of-the-art methods across diverse benchmarks—including maze navigation, static image reasoning, video-based spatial reasoning, and multi-view spatial reasoning—demonstrating substantial advances in grounded spatial understanding.

Technology Category

Application Category

📝 Abstract

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in vision-language models

Overcoming text-centric limitations in geometric understanding

Enabling visual drawing for spatial relationship analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enables LVLMs to reason through visual drawing operations

Uses synthetic data and reinforcement learning for training

Improves spatial reasoning with direct visual manipulation

🔎 Similar Papers

No similar papers found.