🤖 AI Summary
This work addresses deployment bottlenecks in generative world-action models for embodied manipulation tasks—specifically, redundant pixel reconstruction, linearly growing memory consumption, and high inference latency—by introducing the Causal Latent World Model (CLWM). CLWM leverages DINOv2 features as generative targets to disentangle interaction semantics from visual noise and incorporates a constant-memory dual-state test-time training (TTT) mechanism alongside an asynchronous speculative inference strategy to substantially reduce latency. Integrated within the online EmbodiChain framework, which enables continuous injection of physically realistic trajectories, CLWM achieves state-of-the-art performance in complex dual-arm simulation environments and demonstrates, for the first time, zero-shot sim-to-real transfer that surpasses fine-tuned real-data baselines.
📝 Abstract
Deploying generative World-Action Models for manipulation is severely bottlenecked by redundant pixel-level reconstruction, $\mathcal{O}(T)$ memory scaling, and sequential inference latency. We introduce the Causal Latent World Model (CLWM), which employs DINOv3 features as generative targets to disentangle interaction semantics from visual noise, yielding highly robust domain generalization. To overcome memory scaling, CLWM features a Dual-State Test-Time Training (TTT) Memory that guarantees a strict $\mathcal{O}(1)$ footprint for long-horizon tasks. To overcome deployment latency, we propose Speculative Asynchronous Inference (SAI) to mask partial diffusion denoising behind physical execution, cutting blocking latency by about $50\%$. To scale robust policies, we present EmbodiChain, an online framework that establishes the Efficiency Law by injecting an infinite flow of physics-grounded trajectories during training. Extensive experiments validate that CLWM achieves state-of-the-art performance in complex dual-arm simulation and unprecedented zero-shot sim-to-real transfer on physical robots, outperforming baselines explicitly finetuned on real-world data.