Vidarc: Embodied Video Diffusion Model for Closed-loop Control

๐Ÿ“… 2025-12-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high latency and weak physical grounding arising from the decoupling of video understanding and closed-loop control in robotic manipulation under data-scarce conditions, this paper proposes a mask-augmented inverse-dynamics-enhanced autoregressive video diffusion model. Our method integrates action-aware video prediction with cache-based real-time feedback, leveraging an action-correlated masking mechanism, inverse-dynamics modeling, and large-scale pretraining on cross-modal robot interaction segments (millions of samples) to enable end-to-end action generation and dynamic correction. Its key innovation lies in the first incorporation of inverse-dynamics priors into a video diffusion architecture, enabling low-latency closed-loop control. Experiments demonstrate a โ‰ฅ15% improvement in task success rate in real-world deployment, a 91% reduction in average control latency, and strong cross-platform generalization and online error recovery capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.
Problem

Research questions and friction points this paper is trying to address.

Addresses robotic arm control in data-scarce environments
Improves closed-loop control with low latency and grounding
Enhances generalization across unseen robotic platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive embodied video diffusion model
Masked inverse dynamics model for grounding
Cached autoregressive generation for real-time feedback
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yao Feng
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Chendong Xiang
Chendong Xiang
First-year PHD student of computer science and technology๏ผŒ Tsinghua university
generate modelembodied AI
Xinyi Mao
Xinyi Mao
Undergraduate, Tsinghua University
RoboticsEmbodied AI
Hengkai Tan
Hengkai Tan
Tsinghua University
Reinforcement LearningRobot LearningEmbodied AIDeep Generative Models
Z
Zuyue Zhang
School of Architecture, Tsinghua University
S
Shuhe Huang
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
K
Kaiwen Zheng
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
H
Haitian Liu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
H
Hang Su
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
J
Jun Zhu
Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University