π€ AI Summary
Existing Gaussian splatting methods lack instance-level semantic understanding in dynamic scenes, hindering stable tracking and semantic reasoning. This work proposes a unified spatiotemporal Gaussian representation that, for the first time, embeds instance-consistent semantics into a 4D Gaussian model. By integrating learnable semantic features, a dual-MLP decoder, supervision from multimodal large language models, and 2D optical flowβguided motion optimization, the method jointly models human motion, high-fidelity rendering, and open-vocabulary semantics. It achieves temporally coherent 4D reconstruction, supports instance segmentation and semantic querying, and significantly enhances semantic comprehension and temporal stability in dynamic foreground modeling.
π Abstract
Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.