RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the inconsistency between predicted contact regions and manipulation poses in existing spatial affordance prediction methods, which often model these components separately and thereby risk task failure. To resolve this, we propose RoboPCA, a novel framework that, for the first time, centers affordance prediction around manipulation poses by jointly estimating task-relevant contact regions and poses. We introduce the Human2Afford data pipeline to automatically extract 3D scene information from human demonstrations and generate pose-centric affordance annotations. Our architecture employs an RGB-D encoder to fuse geometric and appearance features, enhanced with a mask-aware mechanism to focus on task-relevant objects, and leverages a diffusion model for joint pose-region prediction. Experiments demonstrate that our approach outperforms baselines across image, simulation, and real-robot platforms, exhibiting strong generalization across tasks and object categories.

Technology Category

Application Category

📝 Abstract

Understanding spatial affordances -- comprising the contact regions of object interaction and the corresponding contact poses -- is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.

Problem

Research questions and friction points this paper is trying to address.

spatial affordance

contact region

pose estimation

robot manipulation

human demonstration

Innovation

Methods, ideas, or system contributions that make the work stand out.

pose-centered affordance

human demonstration

RoboPCA