Consistent Zero-Shot Imitation with Contrastive Goal Inference

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses zero-shot task adaptation for embodied agents by proposing a self-supervised, interactive pretraining framework wherein goals (observations) serve as the fundamental learning unit. Methodologically, it integrates exploratory reinforcement learning, contrastive goal representation learning, and amortized inverse reinforcement learning—enabling agents to autonomously discover goals, practice goal achievement, and, at test time, infer expert intent via contrastive goal reasoning—without any human-annotated data. The core contribution is the first demonstration of zero-shot, task-agnostic, consistent imitation learning, fully eliminating the need for task-specific demonstrations or labels. Evaluated on standard benchmarks, the approach substantially outperforms prior methods: it achieves state-of-the-art performance on goal-directed tasks and further exhibits strong generalization to non-goal-structured tasks, underscoring its robust cross-task adaptability.

Technology Category

Application Category

📝 Abstract

In the same way that generative models today conduct most of their training in a self-supervised fashion, how can agentic models conduct their training in a self-supervised fashion, interactively exploring, learning, and preparing to quickly adapt to new tasks? A prerequisite for embodied agents deployed in real world interactions ought to be training with interaction, yet today's most successful AI models (e.g., VLMs, LLMs) are trained without an explicit notion of action. The problem of pure exploration (which assumes no data as input) is well studied in the reinforcement learning literature and provides agents with a wide array of experiences, yet it fails to prepare them for rapid adaptation to new tasks. Today's language and vision models are trained on data provided by humans, which provides a strong inductive bias for the sorts of tasks that the model will have to solve (e.g., modeling chords in a song, phrases in a sonnet, sentences in a medical record). However, when they are prompted to solve a new task, there is a faulty tacit assumption that humans spend most of their time in the most rewarding states. The key contribution of our paper is a method for pre-training interactive agents in a self-supervised fashion, so that they can instantly mimic human demonstrations. Our method treats goals (i.e., observations) as the atomic construct. During training, our method automatically proposes goals and practices reaching them, building off prior work in reinforcement learning exploration. During evaluation, our method solves an (amortized) inverse reinforcement learning problem to explain demonstrations as optimal goal-reaching behavior. Experiments on standard benchmarks (not designed for goal-reaching) show that our approach outperforms prior methods for zero-shot imitation.

Problem

Research questions and friction points this paper is trying to address.

Training interactive agents through self-supervised exploration without human data

Enabling agents to rapidly adapt and mimic human demonstrations instantly

Solving inverse reinforcement learning to explain demonstrations as goal-reaching behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pretraining for interactive agents

Automated goal proposal and practice mechanism

Inverse reinforcement learning for demonstration interpretation

🔎 Similar Papers

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data