🤖 AI Summary
Existing articulated object recognition methods often rely on priors about the number of parts, depth-image inputs, or complex intermediate representations, limiting their generalizability and practical applicability. This paper introduces the first end-to-end framework for kinematic structure estimation of articulated objects from a single RGB image. Our approach uniquely integrates random initialization and optimization of screw axes with Gaussian Splatting-based 3D reconstruction—requiring neither geometric priors nor depth modalities—and jointly optimizes motion axes, rigid-part segmentation, and scene geometry in a unified manner. Evaluated on standard benchmarks, our method achieves state-of-the-art accuracy. Moreover, it supports text-guided zero-shot manipulation generation, significantly enhancing robustness and generalizability for articulated object understanding in real-world scenarios.
📝 Abstract
Articulated object recognition -- the task of identifying both the geometry and kinematic joints of objects with movable parts -- is essential for enabling robots to interact with everyday objects such as doors and laptops. However, existing approaches often rely on strong assumptions, such as a known number of articulated parts; require additional inputs, such as depth images; or involve complex intermediate steps that can introduce potential errors -- limiting their practicality in real-world settings. In this paper, we introduce ScrewSplat, a simple end-to-end method that operates solely on RGB observations. Our approach begins by randomly initializing screw axes, which are then iteratively optimized to recover the object's underlying kinematic structure. By integrating with Gaussian Splatting, we simultaneously reconstruct the 3D geometry and segment the object into rigid, movable parts. We demonstrate that our method achieves state-of-the-art recognition accuracy across a diverse set of articulated objects, and further enables zero-shot, text-guided manipulation using the recovered kinematic model.