RoboSSM: Scalable In-context Imitation Learning via State-Space Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor long-context extrapolation capability of Transformers in contextual imitation learning, this paper introduces state space models (SSMs) to few-shot robotic task learning for the first time, proposing an efficient and scalable framework based on the Longhorn architecture. The method enables linear-time-complexity inference over long sequences and integrates a context prompting mechanism to model action sequences while supporting cross-task generalization. Evaluated on the LIBERO benchmark, it significantly outperforms Transformer-based baselines—particularly under low-shot demonstration settings, unseen tasks, and long-horizon scenarios—demonstrating superior robustness and generalization. Key contributions include: (1) establishing the first SSM-based paradigm for contextual imitation learning; (2) overcoming the long-context bottleneck inherent to Transformers; and (3) providing a viable pathway for resource-constrained robotic learning.

Technology Category

Application Category

📝 Abstract
In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.
Problem

Research questions and friction points this paper is trying to address.

Transformers have computational limitations for long prompts
Current methods underperform with longer prompts than training
Need efficient scalable backbone for in-context imitation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces Transformers with state-space models
Uses Longhorn SSM for linear-time inference
Enables robust long-context prompt handling
🔎 Similar Papers
No similar papers found.
Y
Youngju Yoo
The University of Texas at Austin, KAIST
Jiaheng Hu
Jiaheng Hu
UT-Austin
Robot LearningReinforcement LearningRoboticsMobile Manipulation
Y
Yifeng Zhu
The University of Texas at Austin
B
Bo Liu
The University of Texas at Austin, FAIR at Meta
Q
Qiang Liu
The University of Texas at Austin
Roberto Martín-Martín
Roberto Martín-Martín
The University of Texas at Austin
RoboticsArtificial PerceptionMachine LearningInteractive PerceptionProbabilistic Reasoning
P
Peter Stone
The University of Texas at Austin, Sony AI