🤖 AI Summary
Robotic manipulation policies often fail in zero-shot generalization due to tight coupling among “where to attend,” “what to execute,” and “how to act.” This paper introduces PEEK, the first framework leveraging vision-language models (VLMs) to jointly generate task-critical point trajectories and spatial masks—thereby decoupling perception, decision-making, and execution into a lightweight, transferable intermediate representation. Our method comprises VLM fine-tuning, cross-modal alignment, an automated annotation pipeline, and multi-robot data fusion. PEEK is architecture-agnostic and enables plug-and-play action guidance without modifying underlying policy designs. Experiments demonstrate that 3D policies trained purely in simulation achieve a 41.4× performance gain when deployed on real robots; both large-scale VLM-based action models and compact manipulation policies attain 2–3.5× improvements in zero-shot generalization across unseen tasks and environments.
📝 Abstract
Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: 1. end-effector paths specifying what actions to take, and 2. task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4x real-world improvement for a 3D policy trained only in simulation, and 2-3.5x gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need--where, what, and how. Website at https://peek-robot.github.io/.