Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following Temporal Representation Alignment

📅 2025-02-08

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses temporal representation alignment for robot instruction following, enabling zero-shot generalization to multi-step composite tasks by composing learned primitive-task representations—without explicit subtask decomposition or reinforcement learning. We propose an end-to-end representation learning framework that models task semantics via successor features and introduces a contrastive temporal alignment loss, enforcing alignment between current-state and future-goal-state representations within a shared cross-modal (language/image) embedding space. Crucially, we find that this temporal alignment loss alone induces emergent compositional task reasoning, obviating the need for modular architectures or RL-based optimization. Evaluated on diverse real-world and simulated robotic manipulation tasks, our method achieves an average 37% improvement in zero-shot success rate for composite instructions under both language- and image-specified goals.

Technology Category

Application Category

📝 Abstract

Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

Problem

Research questions and friction points this paper is trying to address.

Enhances compositional generalization in robotics

Learns temporal alignment without explicit planning

Improves task performance with language or images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal alignment loss

Successor features

Compositional generalization

🔎 Similar Papers

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks