SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive world models suffer from spatial structural distortion, low decoding efficiency, and weak motion modeling in video prediction. To address these issues, we propose a generative world model that establishes a hybrid spatiotemporal modeling paradigm: it integrates intra-frame bidirectional spatial attention with causal temporal decoding, introduces a trajectory-aware motion prompting module, and employs an asymmetric multi-scale tokenizer—while enabling parallel autoregressive decoding. Our framework significantly improves spatiotemporal consistency and physical plausibility. It achieves state-of-the-art performance on action-conditioned video prediction and model-based control tasks. Moreover, inference speed is accelerated by 4.4× compared to baseline methods. The model demonstrates zero-shot transfer capability across domains and exhibits strong scalability to varying input resolutions and sequence lengths.

Technology Category

Application Category

📝 Abstract
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose extbf{S}cale-wise extbf{A}utoregression with extbf{M}otion extbf{P}r extbf{O}mpt ( extbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$ imes$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
Problem

Research questions and friction points this paper is trying to address.

Improving visual coherence in autoregressive world model predictions
Enhancing motion modeling and dynamic scene understanding efficiency
Addressing spatial structure disruption and inefficient decoding issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid framework combining autoregressive and causal modeling
Asymmetric multi-scale tokenizer optimizing memory and performance
Trajectory-aware motion prompt module enhancing temporal consistency
🔎 Similar Papers
No similar papers found.
S
Sen Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jingyi Tian
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Z
Zhimin Liao
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jiayi Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
H
Huaiyi Dong
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
K
Kun Xia
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Sanping Zhou
Sanping Zhou
Xi'an Jiaotong University
Computer VisionMachine Learning
W
Wei Tang
University of Illinois at Chicago
H
Hua Gang
Amazon.com, Inc.