🤖 AI Summary
This work addresses the problem of unsupervised, end-to-end articulated object digital twin reconstruction from static multi-view RGB images alone. The proposed method requires only two sets of such images—without video sequences, 3D annotations, motion priors, semantic labels, or explicit segmentation supervision—and jointly estimates part-level geometry, appearance, and articulation parameters. Built upon 3D Gaussian Splatting (3D-GS), it introduces a multi-stage optimization framework featuring part-decoupled representations and self-supervised motion consistency constraints to effectively disentangle the highly entangled shape, appearance, and joint motion parameters. Experimental results demonstrate state-of-the-art performance in part segmentation accuracy, motion estimation error, and novel-view rendering quality. This work establishes a new paradigm for fully unsupervised digital twin modeling of articulated objects.
📝 Abstract
We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.