๐ค AI Summary
Existing large multimodal language models (LMMs) struggle with continuous four-dimensional spatiotemporal reasoning, often failing to accurately answer questions involving dynamic spatial-temporal relationships or generate structured object trajectories. To address this limitation, this work introduces a trajectory-anchored multi-turn spatiotemporal dialogue task and presents Track4D-Bench, a new evaluation benchmark. Building upon this framework, we develop LMM-Track4Dโthe first model to integrate structured 3D trajectory prediction with multi-turn spatiotemporal dialogue. Our approach incorporates RayโTime Geometry Encoding (RTGE), a streaming state token TRK, and an Object-Slot Kinematic Residual-Anchor (OSK-RA) decoder to explicitly model dynamic object states. The method achieves robust four-step 3D state estimation under challenging conditions such as long-term occlusion and viewpoint changes, significantly outperforming strong baselines on Track4D-Bench and demonstrating the efficacy of explicit dynamic modeling in enhancing LMMsโ 4D reasoning capabilities.
๐ Abstract
Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.