🤖 AI Summary
This work addresses the challenges of dynamic scene decomposition and long-term identity tracking in multi-view videos caused by inconsistent instance labels across views. To this end, we propose a cross-video instance matching framework that leverages a latent label-permutation variable coupled with a differentiable Sinkhorn layer, along with an instance-decomposed motion skeleton to refine long-horizon 4D Gaussian trajectories. Our approach achieves, for the first time, stable instance-level 4D Gaussian splatting reconstruction without identity drift. Evaluated on the Panoptic Studio dataset, the method attains a PSNR of 28.36 (+2.26) and improves instance mIoU to 0.9129 (+0.2819), significantly outperforming existing approaches.
📝 Abstract
We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.