From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

225K/year
๐Ÿค– AI Summary
Existing video generation models struggle to translate imagined future visual sequences into stable, executable robot actions due to a mismatch between visual realism and control requirements. This work proposes MoLA (Mixture of Latent Actions), which innovatively integrates multimodal perceptual cuesโ€”such as semantics, depth, and optical flowโ€”to infer structured, physically plausible latent action representations from generated videos. By leveraging a mixture of pretrained inverse dynamics models, MoLA effectively bridges the gap between visual imagination and policy execution. The approach significantly improves task success rates, temporal consistency, and generalization across simulated and real-world robotic benchmarks, including LIBERO, CALVIN, and LIBERO-Plus.
๐Ÿ“ Abstract
Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
Problem

Research questions and friction points this paper is trying to address.

robot manipulation
video generation
action execution
inverse dynamics
latent actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Latent Actions
inverse dynamics models
video imagination
robot manipulation
action-centric representation
๐Ÿ”Ž Similar Papers