🤖 AI Summary
Multimodal language models exhibit significant limitations in spatiotemporal reasoning, compounded by the scarcity of high-precision spatial annotations in real-world videos. To address this, we propose SIMS-V—a scalable, precisely annotatable synthetic video generation framework built upon a 3D simulator. Our key insight is that optimal cross-domain transfer for spatial reasoning hinges on mastering only three fundamental capabilities: metric measurement, viewpoint-dependent reasoning, and temporal tracking. By instruction-tuning a 7B video-language model on just 25K synthetic samples, SIMS-V surpasses a 72B baseline on real-world spatial reasoning benchmarks—matching the performance of proprietary models—while preserving general video understanding capabilities. This work provides the first systematic empirical validation that small-scale, structurally grounded synthetic data can achieve both high efficiency and strong generalization in complex spatial reasoning tasks.
📝 Abstract
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.