Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) face significant bottlenecks in dynamic spatial reasoning (DSR)—the task of understanding how object geometries and 3D spatial relationships evolve over time—primarily due to the lack of scalable 4D perceptual training resources. To address this, we introduce DSR Suite: the first video-driven benchmark and dataset explicitly designed for 4D understanding, covering multi-object interaction, viewpoint transformation, and fine-grained process reasoning. We propose a lightweight, plug-and-play Geometric Selection Module (GSM) that injects task-directed 3D/4D priors into VLMs via geometric tokenization and cross-modal alignment. Furthermore, we develop an automated 4D multiple-choice question generation pipeline, integrating vision foundation model–driven video geometric parsing (pose estimation, point clouds, trajectories, masks) with human refinement. Evaluated on Qwen2.5-VL-7B, our approach achieves substantial DSR performance gains while preserving general video understanding accuracy.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses dynamic spatial reasoning in vision-language models
Generates 4D-aware training data from in-the-wild videos
Integrates geometric priors to enhance 3D temporal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates 4D question-answer pairs from videos
Geometry Selection Module integrates geometric priors into vision-language models
Lightweight module extracts relevant geometry tokens for dynamic reasoning
🔎 Similar Papers
No similar papers found.