4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
Existing vision-language models suffer from representational redundancy, insufficient accuracy, and reliance on external modules when performing dynamic spatial reasoning from monocular videos. To address these limitations, this work proposes 4DThinker, a novel framework that introduces, for the first time, a dynamic 4D mental imagery mechanism enabling intrinsic simulation-based reasoning of 4D scene evolution within a continuous latent space. The approach leverages unlabeled 4D data synthesis, employs Dynamic-Imagery Fine-Tuning (DIFT) to jointly optimize textual inputs and 4D latent variables, and incorporates a 4D reinforcement learning strategy that applies gradients exclusively to text tokens. Experimental results demonstrate that 4DThinker significantly outperforms strong baselines across multiple dynamic spatial reasoning benchmarks, confirming the superiority of its endogenous capability for dynamic scene understanding.
📝 Abstract
Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
Problem

Research questions and friction points this paper is trying to address.

dynamic spatial reasoning
vision-language models
4D imagery
monocular video
visual intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D reasoning
dynamic spatial understanding
latent mental imagery
vision-language models
reinforcement learning
🔎 Similar Papers
No similar papers found.