🤖 AI Summary
Dense 4D reconstruction—joint estimation of camera intrinsics, extrinsics, and per-frame depth—from egocentric videos remains challenging under label-scarce conditions due to the absence of high-quality ground-truth annotations.
Method: We propose EgoMono4D, the first self-supervised monocular 4D reconstruction framework for egocentric video. It builds upon pre-trained single-frame depth and intrinsic models, incorporates a differentiable, learnable pose estimation module, and jointly optimizes all parameters end-to-end via multi-frame geometric consistency constraints and self-supervised photometric loss.
Contribution/Results: EgoMono4D is the first to systematically apply self-supervised learning to egocentric 4D reconstruction under extreme label scarcity, enabling zero-shot cross-domain generalization. It significantly outperforms existing methods both in-domain and in zero-shot settings, producing dense temporal point cloud sequences. The code, pretrained models, and interactive visualizations are publicly released.
📝 Abstract
Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce EgoMono4D, a novel model that unifies the estimation of multiple variables necessary for Egocentric Monocular 4D reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The interactable visualization, code and trained models are released https://egomono4d.github.io/