π€ AI Summary
Existing methods rely on unstable video segmentation, leading to poor robustness in multi-view 4D scene reconstruction. To address this, we propose Freetime FeatureGSβa segmentation-free, decoupled 4D reconstruction framework. It leverages single-frame image segmentation as weak supervision to guide Gaussian primitives in learning differentiable temporal features, linear motion modeling, and cross-frame contrastive constraints. Our approach integrates dynamic Gaussian splatting rendering, temporal contrastive loss, and a streaming ordered sampling strategy. To our knowledge, this is the first method enabling instance-level, segmentation-agnostic 4D reconstruction with natural temporal extrapolation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks: higher reconstruction accuracy, enhanced optimization robustness, and effective mitigation of local minima.
π Abstract
This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.