Vidarc: Embodied Video Diffusion Model for Closed-loop Control

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address the high latency and weak physical grounding arising from the decoupling of video understanding and closed-loop control in robotic manipulation under data-scarce conditions, this paper proposes a mask-augmented inverse-dynamics-enhanced autoregressive video diffusion model. Our method integrates action-aware video prediction with cache-based real-time feedback, leveraging an action-correlated masking mechanism, inverse-dynamics modeling, and large-scale pretraining on cross-modal robot interaction segments (millions of samples) to enable end-to-end action generation and dynamic correction. Its key innovation lies in the first incorporation of inverse-dynamics priors into a video diffusion architecture, enabling low-latency closed-loop control. Experiments demonstrate a ≥15% improvement in task success rate in real-world deployment, a 91% reduction in average control latency, and strong cross-platform generalization and online error recovery capability.

Technology Category

Application Category

📝 Abstract

Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.

Problem

Research questions and friction points this paper is trying to address.

Addresses robotic arm control in data-scarce environments

Improves closed-loop control with low latency and grounding

Enhances generalization across unseen robotic platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive embodied video diffusion model

Masked inverse dynamics model for grounding

Cached autoregressive generation for real-time feedback

🔎 Similar Papers

No similar papers found.