🤖 AI Summary
Dual-arm robots suffer from poor operational robustness in highly randomized environments, inaccurate segmentation of task-critical objects, low multi-task learning efficiency, and redundant action spaces. Method: This paper proposes a 3D manipulability-guided sparse diffusion policy framework that integrates simulator-supervised semantic segmentation, task-conditioned feature encoding, and manipulability-anchored key-pose diffusion. It jointly predicts joint angles and end-effector poses via geometric consistency, drastically compressing the action space and accelerating convergence. Technically, it incorporates point-cloud augmentation, a lightweight task encoder, and full-state-supervised diffusion modeling to enable end-to-end vision–motor policy learning. Contribution/Results: Evaluated on the RoboTwin benchmark, the method achieves a 98.7% average success rate and demonstrates strong generalization across extreme environmental randomizations—including object appearance, clutter level, table height, illumination, and background variation.
📝 Abstract
We present AnchorDP3, a diffusion policy framework for dual-arm robotic manipulation that achieves state-of-the-art performance in highly randomized environments. AnchorDP3 integrates three key innovations: (1) Simulator-Supervised Semantic Segmentation, using rendered ground truth to explicitly segment task-critical objects within the point cloud, which provides strong affordance priors; (2) Task-Conditioned Feature Encoders, lightweight modules processing augmented point clouds per task, enabling efficient multi-task learning through a shared diffusion-based action expert; (3) Affordance-Anchored Keypose Diffusion with Full State Supervision, replacing dense trajectory prediction with sparse, geometrically meaningful action anchors, i.e., keyposes such as pre-grasp pose, grasp pose directly anchored to affordances, drastically simplifying the prediction space; the action expert is forced to predict both robot joint angles and end-effector poses simultaneously, which exploits geometric consistency to accelerate convergence and boost accuracy. Trained on large-scale, procedurally generated simulation data, AnchorDP3 achieves a 98.7% average success rate in the RoboTwin benchmark across diverse tasks under extreme randomization of objects, clutter, table height, lighting, and backgrounds. This framework, when integrated with the RoboTwin real-to-sim pipeline, has the potential to enable fully autonomous generation of deployable visuomotor policies from only scene and instruction, totally eliminating human demonstrations from learning manipulation skills.