AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dual-arm robots suffer from poor operational robustness in highly randomized environments, inaccurate segmentation of task-critical objects, low multi-task learning efficiency, and redundant action spaces. Method: This paper proposes a 3D manipulability-guided sparse diffusion policy framework that integrates simulator-supervised semantic segmentation, task-conditioned feature encoding, and manipulability-anchored key-pose diffusion. It jointly predicts joint angles and end-effector poses via geometric consistency, drastically compressing the action space and accelerating convergence. Technically, it incorporates point-cloud augmentation, a lightweight task encoder, and full-state-supervised diffusion modeling to enable end-to-end vision–motor policy learning. Contribution/Results: Evaluated on the RoboTwin benchmark, the method achieves a 98.7% average success rate and demonstrates strong generalization across extreme environmental randomizations—including object appearance, clutter level, table height, illumination, and background variation.

Technology Category

Application Category

📝 Abstract
We present AnchorDP3, a diffusion policy framework for dual-arm robotic manipulation that achieves state-of-the-art performance in highly randomized environments. AnchorDP3 integrates three key innovations: (1) Simulator-Supervised Semantic Segmentation, using rendered ground truth to explicitly segment task-critical objects within the point cloud, which provides strong affordance priors; (2) Task-Conditioned Feature Encoders, lightweight modules processing augmented point clouds per task, enabling efficient multi-task learning through a shared diffusion-based action expert; (3) Affordance-Anchored Keypose Diffusion with Full State Supervision, replacing dense trajectory prediction with sparse, geometrically meaningful action anchors, i.e., keyposes such as pre-grasp pose, grasp pose directly anchored to affordances, drastically simplifying the prediction space; the action expert is forced to predict both robot joint angles and end-effector poses simultaneously, which exploits geometric consistency to accelerate convergence and boost accuracy. Trained on large-scale, procedurally generated simulation data, AnchorDP3 achieves a 98.7% average success rate in the RoboTwin benchmark across diverse tasks under extreme randomization of objects, clutter, table height, lighting, and backgrounds. This framework, when integrated with the RoboTwin real-to-sim pipeline, has the potential to enable fully autonomous generation of deployable visuomotor policies from only scene and instruction, totally eliminating human demonstrations from learning manipulation skills.
Problem

Research questions and friction points this paper is trying to address.

Enhances robotic manipulation in randomized environments
Integrates affordance priors for task-critical object segmentation
Simplifies action prediction via sparse keypose diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulator-Supervised Semantic Segmentation for affordance priors
Task-Conditioned Feature Encoders for multi-task learning
Affordance-Anchored Keypose Diffusion simplifies prediction space
🔎 Similar Papers
No similar papers found.
Z
Ziyan Zhao
Jingdong Technology Information Technology Co., Ltd
Ke Fan
Ke Fan
Fudan University
Machine LearningDeep Learning
H
He-Yang Xu
Southeast University
Ning Qiao
Ning Qiao
Jingdong Technology Information Technology Co., Ltd
B
Bo Peng
Jingdong Technology Information Technology Co., Ltd
W
Wenlong Gao
Jingdong Technology Information Technology Co., Ltd
D
Dongjiang Li
Jingdong Technology Information Technology Co., Ltd
H
Hui Shen
Jingdong Technology Information Technology Co., Ltd