🤖 AI Summary
This work addresses the challenge of modeling kinematic parameters of unknown articulated objects from a single human demonstration, without prior knowledge and under conditions of occlusion or operational sequence constraints. The authors propose PokeNet, an end-to-end point cloud sequence learning framework that requires no predefined object categories or joint types. By integrating temporal modeling with geometric reasoning, PokeNet simultaneously estimates joint axes, tracks articulation states, and infers the sequence of manipulation actions. As the first method capable of recovering a complete kinematic model of multi-degree-of-freedom articulated objects from only a single demonstration, PokeNet significantly outperforms existing approaches in both simulated and real-world settings, achieving an average improvement of over 27% in joint axis and state estimation accuracy.
📝 Abstract
Articulation modeling enables robots to learn joint parameters of articulated objects for effective manipulation which can then be used downstream for skill learning or planning. Existing approaches often rely on prior knowledge about the objects, such as the number or type of joints. Some of these approaches also fail to recover occluded joints that are only revealed during interaction. Others require large numbers of multi-view images for every object, which is impractical in real-world settings. Furthermore, prior works neglect the order of manipulations, which is essential for many multi-DoF objects where one joint must be operated before another, such as a dishwasher. We introduce PokeNet, an end-to-end framework that estimates articulation models from a single human demonstration without prior object knowledge. Given a sequence of point cloud observations of a human manipulating an unknown object, PokeNet predicts joint parameters, infers manipulation order, and tracks joint states over time. PokeNet outperforms existing state-of-the-art methods, improving joint axis and state estimation accuracy by an average of over 27% across diverse objects, including novel and unseen categories. We demonstrate these gains in both simulation and real-world environments.