MARS Policy: Multimodality Only When It Matters

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the inefficiency in existing multimodal imitation learning for robotics, where stochasticity is typically applied throughout the entire policy, leading to unnecessarily complex training and slow inference, despite behavioral diversity being required only in certain task phases. To overcome this, the authors propose MARS (Modality-Adaptive Random Sampling), a novel strategy that dynamically determines whether a given task stage necessitates diversity and injects conditional stochastic noise only when needed, otherwise defaulting to an efficient deterministic policy. MARS is the first approach to simultaneously preserve multimodal expressiveness and substantially improve computational efficiency. Evaluated across eight simulated and four real-world tasks, it achieves a 16.67% higher success rate in physical environments and reduces inference latency by 83.20%, while demonstrating superior training efficiency compared to state-of-the-art methods.

📝 Abstract

Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.

Problem

Research questions and friction points this paper is trying to address.

multimodality

imitation learning

robotic manipulation

inference efficiency

behavioral diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodality

imitation learning

stochasticity