PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robotic manipulation policies often fail in zero-shot generalization due to tight coupling among “where to attend,” “what to execute,” and “how to act.” This paper introduces PEEK, the first framework leveraging vision-language models (VLMs) to jointly generate task-critical point trajectories and spatial masks—thereby decoupling perception, decision-making, and execution into a lightweight, transferable intermediate representation. Our method comprises VLM fine-tuning, cross-modal alignment, an automated annotation pipeline, and multi-robot data fusion. PEEK is architecture-agnostic and enables plug-and-play action guidance without modifying underlying policy designs. Experiments demonstrate that 3D policies trained purely in simulation achieve a 41.4× performance gain when deployed on real robots; both large-scale VLM-based action models and compact manipulation policies attain 2–3.5× improvements in zero-shot generalization across unseen tasks and environments.

Technology Category

Application Category

📝 Abstract
Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: 1. end-effector paths specifying what actions to take, and 2. task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4x real-world improvement for a 3D policy trained only in simulation, and 2-3.5x gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need--where, what, and how. Website at https://peek-robot.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Robotic manipulation policies fail to generalize across different tasks
Policies must simultaneously learn where to attend and what actions to take
Existing approaches struggle with zero-shot generalization across robot embodiments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes VLMs to predict unified point-based representations
Generates end-effector paths and task-relevant masks automatically
Overlays annotations on robot observations for policy-agnostic transfer