🤖 AI Summary
Existing methods for reconstructing the geometry, appearance, and physical properties of non-rigid objects from monocular videos rely on time-consuming per-scene optimization or manual annotations, limiting their practicality and generalization. This work proposes the first self-supervised feed-forward framework that jointly estimates physical parameters and performs 3D Gaussian splatting reconstruction via a dual-branch network. Using only a single monocular video, the method simultaneously recovers geometry, appearance, and physical attributes without requiring ground-truth physics labels. By integrating differentiable rendering with self-supervised training, it achieves state-of-the-art performance on large-scale synthetic data: future-frame prediction PSNR reaches 21.64 (versus 13.27 for prior SOTA), Chamfer Distance drops to 0.004 (from 0.349), and inference time is under one second—dramatically improving both efficiency and generalization compared to optimization-based approaches that take hours.
📝 Abstract
Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.