Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D methods for egocentric videos struggle to accurately segment dynamic objects due to their reliance on static scene assumptions. To address this, we propose a 3D motion segmentation framework that synergistically integrates 2D motion segmentation with layered radiance fields. Our key contributions are: (1) the first hierarchical motion fusion mechanism, which explicitly embeds 2D motion segmentation outputs into the 3D representation via layered radiance field decomposition; and (2) test-time frame-level geometric refinement, jointly optimizing motion and geometric consistency across frames. Evaluated on dynamic egocentric benchmarks—including EPIC-Kitchens—our method significantly outperforms 2D segmentation baselines and achieves state-of-the-art performance. This work constitutes the first empirical validation that explicit 3D modeling substantially enhances motion understanding in complex, dynamic first-person vision tasks, demonstrating both feasibility and superiority over purely 2D approaches.

Technology Category

Application Category

📝 Abstract
Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D dynamic object segmentation in egocentric videos
Fusing 2D motion cues into layered 3D radiance fields
Overcoming scene complexity via test-time refinement for accurate 3D segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fusing 2D motion segmentation into layered radiance fields
Using test-time refinement to reduce data complexity
Enhancing 3D segmentation via motion fusion synergy
🔎 Similar Papers
No similar papers found.