RoboSSM: Scalable In-context Imitation Learning via State-Space Models

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the high computational cost and poor long-context extrapolation capability of Transformers in contextual imitation learning, this paper introduces state space models (SSMs) to few-shot robotic task learning for the first time, proposing an efficient and scalable framework based on the Longhorn architecture. The method enables linear-time-complexity inference over long sequences and integrates a context prompting mechanism to model action sequences while supporting cross-task generalization. Evaluated on the LIBERO benchmark, it significantly outperforms Transformer-based baselines—particularly under low-shot demonstration settings, unseen tasks, and long-horizon scenarios—demonstrating superior robustness and generalization. Key contributions include: (1) establishing the first SSM-based paradigm for contextual imitation learning; (2) overcoming the long-context bottleneck inherent to Transformers; and (3) providing a viable pathway for resource-constrained robotic learning.

Technology Category

Application Category

📝 Abstract

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

Problem

Research questions and friction points this paper is trying to address.

Transformers have computational limitations for long prompts

Current methods underperform with longer prompts than training

Need efficient scalable backbone for in-context imitation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces Transformers with state-space models

Uses Longhorn SSM for linear-time inference

Enables robust long-context prompt handling

🔎 Similar Papers

HiPPO-Prophecy: State-Space Models can Provably Learn Dynamical Systems in Context