🤖 AI Summary
Existing diffusion-based policies struggle with long-horizon tasks due to their reliance on only current or short-term observations, which fails to resolve ambiguities arising from historical dependencies. This work proposes a state space model (SSM)-based diffusion policy that, for the first time, leverages SSMs to encode the full history of observations. It introduces a hierarchical conditioning mechanism that fuses a compressed global history representation with recent states to generate actions. To preserve critical information about future dynamics, the method incorporates a dynamics-aware auxiliary objective during training. Evaluated on both simulated and real-world robotic manipulation tasks, the proposed model achieves state-of-the-art performance with significantly smaller model size and efficiently handles input histories of arbitrary length.
📝 Abstract
Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.