Causal World Modeling for Robot Control

📅 2026-01-29
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes LingBot-VA, an autoregressive diffusion-based control framework that integrates video world modeling with causal reasoning to enhance long-horizon robotic control and generalization in complex environments. By leveraging a Mixture-of-Transformers architecture, the method constructs a shared latent space for vision and action, enabling joint learning of video frame prediction and policy execution. It further incorporates closed-loop rolling inference and asynchronous parallel control mechanisms to improve temporal coherence and responsiveness. Experimental results demonstrate that LingBot-VA significantly outperforms baseline approaches in both simulation and real-world settings, achieving higher success rates on long-horizon tasks, improved data efficiency, and stronger generalization to novel environmental configurations.

Technology Category

Application Category

📝 Abstract
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
Problem

Research questions and friction points this paper is trying to address.

causal world modeling
robot control
video world models
action-visual dynamics
long-horizon manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

video world modeling
autoregressive diffusion
Mixture-of-Transformers
closed-loop rollout
asynchronous inference