SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

📅 2025-11-06

📈 Citations: 1

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Multimodal language models exhibit significant limitations in spatiotemporal reasoning, compounded by the scarcity of high-precision spatial annotations in real-world videos. To address this, we propose SIMS-V—a scalable, precisely annotatable synthetic video generation framework built upon a 3D simulator. Our key insight is that optimal cross-domain transfer for spatial reasoning hinges on mastering only three fundamental capabilities: metric measurement, viewpoint-dependent reasoning, and temporal tracking. By instruction-tuning a 7B video-language model on just 25K synthetic samples, SIMS-V surpasses a 72B baseline on real-world spatial reasoning benchmarks—matching the performance of proprietary models—while preserving general video understanding capabilities. This work provides the first systematic empirical validation that small-scale, structurally grounded synthetic data can achieve both high efficiency and strong generalization in complex spatial reasoning tasks.

Technology Category

Application Category

📝 Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses spatial reasoning limitations in multimodal video language models

Overcomes data scarcity for spatially annotated video training materials

Identifies key question types for effective real-world spatial intelligence transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates spatially-rich video data using 3D simulators

Identifies three key question categories for transferable spatial intelligence

Trains compact 7B model efficiently with minimal simulated examples

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding