PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Robotic manipulation policies often fail in zero-shot generalization due to tight coupling among “where to attend,” “what to execute,” and “how to act.” This paper introduces PEEK, the first framework leveraging vision-language models (VLMs) to jointly generate task-critical point trajectories and spatial masks—thereby decoupling perception, decision-making, and execution into a lightweight, transferable intermediate representation. Our method comprises VLM fine-tuning, cross-modal alignment, an automated annotation pipeline, and multi-robot data fusion. PEEK is architecture-agnostic and enables plug-and-play action guidance without modifying underlying policy designs. Experiments demonstrate that 3D policies trained purely in simulation achieve a 41.4× performance gain when deployed on real robots; both large-scale VLM-based action models and compact manipulation policies attain 2–3.5× improvements in zero-shot generalization across unseen tasks and environments.

Technology Category

Application Category

📝 Abstract

Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: 1. end-effector paths specifying what actions to take, and 2. task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4x real-world improvement for a 3D policy trained only in simulation, and 2-3.5x gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need--where, what, and how. Website at https://peek-robot.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Robotic manipulation policies fail to generalize across different tasks

Policies must simultaneously learn where to attend and what actions to take

Existing approaches struggle with zero-shot generalization across robot embodiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes VLMs to predict unified point-based representations

Generates end-effector paths and task-relevant masks automatically

Overlays annotations on robot observations for policy-agnostic transfer

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)