Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reconstructing articulated objects from casually captured RGB-D video remains challenging due to poor generalization and robustness in unstructured real-world settings. To address this, we propose the first end-to-end coarse-to-fine framework tailored for real-world scenarios—requiring no object priors, precise calibration, or specialized acquisition protocols. Our method jointly optimizes kinematic parameters while enforcing temporal geometric consistency and performing part-level dynamic segmentation, thereby achieving robust decoupling of structure and motion. Trained on a large-scale synthetic dataset (784 videos, 284 objects) curated by us, it achieves state-of-the-art performance on both synthetic and real-world benchmarks. It accurately recovers kinematic structures across 11 diverse articulated object categories, significantly improving reconstruction accuracy, robustness to noise and occlusion, and cross-category generalization. This work advances articulated object reconstruction toward practical deployment and scalable application.

Technology Category

Application Category

📝 Abstract
Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$ imes$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing articulated objects from casual RGBD videos
Overcoming challenges of simultaneous object-camera movement
Generalizing across object categories with coarse-to-fine framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine framework for joint parameter inference
Dynamic RGBD video segmentation of movable parts
Large synthetic dataset for training and evaluation
🔎 Similar Papers
No similar papers found.