Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of generalizing robotic manipulation policies across diverse environments, objects, and morphologies. We propose the first fully autonomous, offline learning framework that requires no teleoperation—only human demonstration videos as input. Methodologically, we integrate multimodal vision foundation models (SAM/ViT), 3D hand pose estimation, and differentiable keypoint detection to construct a unified semantic keypoint representation, effectively decoupling visual perception from action generation and enabling morphology-agnostic modeling of human hand–object interaction; the policy network is an end-to-end Transformer. Experiments on eight real-world tasks demonstrate a 75% absolute improvement in success rate, a 74% gain in zero-shot generalization to unseen objects, and robust performance under strong background clutter. Our core contributions are: (1) establishing the first purely video-driven paradigm for generalizable manipulation policy learning, and (2) introducing a novel unified keypoint representation mechanism that bridges perception and control.

Technology Category

Application Category

📝 Abstract

Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos and without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Our experiments on 8 real-world tasks demonstrate an overall 75% absolute improvement over prior works when evaluated in identical settings as training. Further, Point Policy exhibits a 74% gain across tasks for novel object instances and is robust to significant background clutter. Videos of the robot are best viewed at https://point-policy.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Learning robot policies from human videos

No teleoperation data required

Morphology-agnostic representation for diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages offline human demonstration videos

Uses vision models for pose translation

Captures object states via key points

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15