MARS Policy: Multimodality Only When It Matters

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in existing multimodal imitation learning for robotics, where stochasticity is typically applied throughout the entire policy, leading to unnecessarily complex training and slow inference, despite behavioral diversity being required only in certain task phases. To overcome this, the authors propose MARS (Modality-Adaptive Random Sampling), a novel strategy that dynamically determines whether a given task stage necessitates diversity and injects conditional stochastic noise only when needed, otherwise defaulting to an efficient deterministic policy. MARS is the first approach to simultaneously preserve multimodal expressiveness and substantially improve computational efficiency. Evaluated across eight simulated and four real-world tasks, it achieves a 16.67% higher success rate in physical environments and reduces inference latency by 83.20%, while demonstrating superior training efficiency compared to state-of-the-art methods.
📝 Abstract
Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.
Problem

Research questions and friction points this paper is trying to address.

multimodality
imitation learning
robotic manipulation
inference efficiency
behavioral diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodality
imitation learning
stochasticity
robotic manipulation
efficient inference
🔎 Similar Papers
No similar papers found.
J
Jindou Jia
MARS Lab, Nanyang Technological University
T
Tuo An
MARS Lab, Nanyang Technological University
Y
Yuxuan Hu
MARS Lab, Nanyang Technological University
Gen Li
Gen Li
Postdoctoral Research Fellow, Nanyang Technological University
Embodied AIComputer VisionRoboticsArtificial Intelligence
J
Jingliang Li
MARS Lab, Nanyang Technological University
Bohan Hou
Bohan Hou
PhD of Computer Science, Carnegie Mellon University
Machine LearningSystems
X
Xiangyu Chen
MARS Lab, Nanyang Technological University
Jiaqi Bai
Jiaqi Bai
Beihang University
Natural Language ProcessingInformation RetrievalLarge Language Model
B
Bofan Lyu
MARS Lab, Nanyang Technological University
Jianfei Yang
Jianfei Yang
Assistant Professor, Director of MARS Lab, Nanyang Technological University
Physical AIEmbodied AIMultimodal AI