Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos

📅 2024-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dense 4D reconstruction—joint estimation of camera intrinsics, extrinsics, and per-frame depth—from egocentric videos remains challenging under label-scarce conditions due to the absence of high-quality ground-truth annotations. Method: We propose EgoMono4D, the first self-supervised monocular 4D reconstruction framework for egocentric video. It builds upon pre-trained single-frame depth and intrinsic models, incorporates a differentiable, learnable pose estimation module, and jointly optimizes all parameters end-to-end via multi-frame geometric consistency constraints and self-supervised photometric loss. Contribution/Results: EgoMono4D is the first to systematically apply self-supervised learning to egocentric 4D reconstruction under extreme label scarcity, enabling zero-shot cross-domain generalization. It significantly outperforms existing methods both in-domain and in zero-shot settings, producing dense temporal point cloud sequences. The code, pretrained models, and interactive visualizations are publicly released.

Technology Category

Application Category

📝 Abstract
Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce EgoMono4D, a novel model that unifies the estimation of multiple variables necessary for Egocentric Monocular 4D reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The interactable visualization, code and trained models are released https://egomono4d.github.io/
Problem

Research questions and friction points this paper is trying to address.

Self-supervised dense scene reconstruction for egocentric videos
Unified estimation of camera intrinsics, poses, and depth
Generalizable 4D reconstruction without labeled datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised dynamic scene reconstruction approach
Unifies camera intrinsic, poses, and depth estimation
Fast feed-forward framework for 4D reconstruction
🔎 Similar Papers
No similar papers found.
Chengbo Yuan
Chengbo Yuan
Institute for Interdisciplinary Information Science (IIIS), Tsinghua University
Embodied AIComputer VisionRobot LearningAgent
G
Geng Chen
Shanghai Artificial Intelligence Laboratory, Shanghai Qi Zhi Institute
L
Li Yi
Institute for Interdisciplinary Information Sciences, Tsinghua University, Shanghai Artificial Intelligence Laboratory, Shanghai Qi Zhi Institute
Y
Yang Gao
Institute for Interdisciplinary Information Sciences, Tsinghua University, Shanghai Artificial Intelligence Laboratory, Shanghai Qi Zhi Institute