GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the performance degradation of existing visuomotor policies under extreme data scarcity and environmental perturbations due to overfitting. To mitigate this, the authors propose Geometric Anchor Pretraining (GAP), a method that pretrains a lightweight, action-free spatial adapter through proxy tasks in simulation, while keeping the vision foundation model frozen. GAP generates stable, spatiotemporally consistent keypoints that span the geometric extent of objects, providing a robust geometric interface for downstream few-shot imitation learning. Notably, it avoids fine-tuning the visual backbone and substantially improves policy performance under data scarcity. On RoboMimic and ManiSkill benchmarks, GAP achieves state-of-the-art results with only 15–50 demonstrations—e.g., 62% success rate on the Can task (+16%) and 61% on StackCube (+11%).

📝 Abstract

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

Problem

Research questions and friction points this paper is trying to address.

visuomotor learning

data efficiency

geometric grounding

robustness

few-shot imitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric Anchor Pre-training

data-efficient visuomotor learning

frozen Vision Foundation Models