LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE

πŸ“… 2025-09-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current video world models face two key challenges in long-horizon embodied manipulation video generation: diffusion-based approaches suffer from temporal inconsistency and visual drift, while autoregressive models compromise pixel-level fidelity. To address this, we propose BlockARβ€”a hybrid architecture that employs diffusion modeling within blocks and autoregression across blocks. BlockAR introduces two core innovations: (1) an action-aware, variable-length semantic chunking mechanism that aligns each video block with a complete manipulation primitive; and (2) a context-aware Mixture-of-Experts (MoE) framework that dynamically activates specialized experts per block. This design jointly ensures temporal coherence and high-fidelity visual reconstruction. Experiments on multi-minute robotic manipulation rollouts demonstrate that BlockAR significantly outperforms state-of-the-art methods, achieving consistent improvements in visual quality, action-logical consistency, and temporal stability.

Technology Category

Application Category

πŸ“ Abstract
Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.
Problem

Research questions and friction points this paper is trying to address.

Achieving stable long-horizon video generation for embodied manipulation
Addressing temporal inconsistency and visual drift in video generation
Improving visual detail while maintaining coherent action sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid diffusion-autoregressive framework for video generation
Action-guided variable-length chunking mechanism
Context-aware Mixture-of-Experts for adaptive generation
πŸ”Ž Similar Papers
No similar papers found.
Yu Shang
Yu Shang
Department of Electronic Engineering, Tsinghua University
Multimodal LearningLLM AgentRecommender System
L
Lei Jin
Tsinghua University
Y
Yiding Ma
Tsinghua University
X
Xin Zhang
Manifold AI
C
Chen Gao
Tsinghua University
W
Wei Wu
Manifold AI
Y
Yong Li
Tsinghua University