PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently modeling multimodal action distributions and fusing multisensory inputs under real-time robotic control constraints. The authors propose PRISM, a novel framework that integrates Rejection Sampling-based Implicit Maximum Likelihood Estimation (RS-IMLE) with Performer linear attention to enable single-pass multimodal policy generation. PRISM incorporates a multisensory temporal encoder processing RGB, depth, tactile, audio, and proprioceptive signals, generating highly diverse yet precise actions without iterative sampling. Evaluated on both physical robots and the CALVIN simulation benchmark, PRISM achieves state-of-the-art performance, improving task success rates by 10–25%, reducing trajectory jitter by 20–50×, and enabling high-frequency closed-loop control at 30–50 Hz.

Technology Category

Application Category

📝 Abstract
Robotic imitation learning typically requires models that capture multimodal action distributions while operating at real-time control rates and accommodating multiple sensing modalities. Although recent generative approaches such as diffusion models, flow matching, and Implicit Maximum Likelihood Estimation (IMLE) have achieved promising results, they often satisfy only a subset of these requirements. To address this, we introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE. PRISM couples a temporal multisensory encoder (integrating RGB, depth, tactile, audio, and proprioception) with a linear-attention generator using a Performer architecture. We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator. Across challenging physical tasks such as pre-manipulation parking, high-precision insertion, and multi-object pick-and-place, PRISM outperforms state-of-the-art diffusion policies by 10-25% in success rate while maintaining high-frequency (30-50 Hz) closed-loop control. We further validate our approach on large-scale simulation benchmarks, including CALVIN, MetaWorld, and Robomimic. In CALVIN (10% data split), PRISM improves success rates by approximately 25% over diffusion and approximately 20% over flow matching, while simultaneously reducing trajectory jerk by 20x-50x. These results position PRISM as a fast, accurate, and multisensory imitation policy that retains multimodal action coverage without the latency of iterative sampling.
Problem

Research questions and friction points this paper is trying to address.

imitation learning
multisensory integration
real-time control
multimodal action distributions
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

single-pass policy
multisensory imitation learning
IMLE with rejection sampling
Performer architecture
real-time robotic control
🔎 Similar Papers
No similar papers found.