mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing vision-language-action (VLA) models rely on pretraining with discrete, static image data, limiting their ability to capture physical dynamics and temporal dependencies—thus necessitating large-scale expert demonstration trajectories. To address this, we propose a video-driven end-to-end control paradigm that, for the first time, integrates internet-scale video foundation models into robotic action modeling, yielding the Video-Action Model (VAM). We introduce a flow-matching–based action decoder that serves as an explicit inverse-dynamics model to encode physical causality, and achieve semantic planning and motion control decoupling in latent space. Evaluated on both simulation and real-robot benchmarks, VAM achieves state-of-the-art performance, improves sample efficiency by 10×, and accelerates convergence by 2× compared to prior methods.

Technology Category

Application Category

📝 Abstract

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.

Problem

Research questions and friction points this paper is trying to address.

VLAs lack physical dynamics understanding from static data

Video-action models integrate semantics and dynamics for better generalization

Proposed VAM improves robot control efficiency and convergence speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrained video model captures semantics and dynamics

Flow matching decoder generates low-level robot actions

Inverse Dynamics Model improves sample efficiency and speed

🔎 Similar Papers

R+X: Retrieval and Execution from Everyday Human Videos