🤖 AI Summary
Manipulating deformable rope-like objects with dual-arm robots poses significant challenges due to high-dimensional deformations and self-occlusions, which hinder the learning of generalizable policies from limited teleoperation data. This work proposes a physically consistent 3D rope state representation by fusing multi-view RGB inputs with a particle-based simulation grounded in Extended Position-Based Dynamics (XPBD). The study compares imitation learning strategies trained on this structured state representation against those using raw visual observations. Evaluated on a rope untangling task using an Action Chunking with Transformers architecture, the state-based policy reduces the L1 error in predicting initial grasp-and-pull actions on unseen rope configurations by 30.8% compared to the vision-based counterpart, substantially improving few-shot generalization. These results underscore the critical role of structured state representations in enhancing observability and data efficiency.
📝 Abstract
Deformable Linear Objects (DLOs) such as ropes and cables are widely encountered in both household and industrial applications, yet remain challenging to manipulate due to their infinite-dimensional configuration space and frequent self-occlusion. Imitation learning from teleoperation offers a practical path to bimanual DLO manipulation, but its scalability is limited by human effort, making the choice of observation space critical for generalization from small datasets. In this study, we investigate whether the lack of generalization in egocentric visual policies for the knot-untangling task stems from the observation space itself, rather than from the policy architecture or data scale. We compare two Action Chunking with Transformers policies trained on the same bimanual teleoperation data: a vision-based policy conditioned on two egocentric RGB streams from wrist-mounted cameras, and a state-based policy conditioned on the DLO's 3D particle state, extracted from an initial observation via multi-view fusion and evolved in a particle-based eXtended Position-Based Dynamics simulation. Evaluated open-loop on an unseen rope configuration, the state-based policy outperforms its visual counterpart with a 30.8% reduction in L1 error when predicting the initial grasp-and-pull action, quantifying the observability gap between pixels and physics-consistent state, and pointing toward more data-efficient robot learning for the DLO manipulation task from limited human demonstrations.