🤖 AI Summary
This work addresses the challenge of aligning natural language prompts with spatiotemporal event dynamics in videos. We propose a language-driven 4D dynamic scene understanding framework that, for the first time, embeds natural language into a differentiable, temporally extended 3D Gaussian Splatting representation, enabling end-to-end text-to-dynamic-3D spatiotemporal localization. Our method integrates a CLIP text encoder with a lightweight spatiotemporal feature distillation module to construct a semantically consistent and geometrically accurate 4D Gaussian field. Evaluated on human and animal 3D video datasets, it reduces spatiotemporal localization error by 37% over baseline methods and supports real-time interactive querying. Key contributions are: (1) the first joint modeling of 4D Gaussian Splatting and linguistic modalities; (2) overcoming static or single-frame semantic alignment limitations via cross-frame semantic-geometric co-optimization; and (3) introducing the first differentiable, efficient, and interactive paradigm for language-guided dynamic 3D localization.
📝 Abstract
The emergence of neural representations has revolutionized our means for digitally viewing a wide range of 3D scenes, enabling the synthesis of photorealistic images rendered from novel views. Recently, several techniques have been proposed for connecting these low-level representations with the high-level semantics understanding embodied within the scene. These methods elevate the rich semantic understanding from 2D imagery to 3D representations, distilling high-dimensional spatial features onto 3D space. In our work, we are interested in connecting language with a dynamic modeling of the world. We show how to lift spatio-temporal features to a 4D representation based on 3D Gaussian Splatting. This enables an interactive interface where the user can spatiotemporally localize events in the video from text prompts. We demonstrate our system on public 3D video datasets of people and animals performing various actions.