Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to maintain spatiotemporal consistency when generating 4D objects. To address this challenge, this work proposes the Spatio-Temporal State Propagation Autoregressive model (STAR) coupled with a 4D Vector Quantized Variational Autoencoder (4D VQ-VAE), which leverages discrete token-based autoregressive modeling to explicitly capture long-range dependencies across historical time steps. The approach introduces a dynamic spatiotemporal memory mechanism and a grouped autoregressive strategy that effectively propagates spatiotemporal states throughout the generation process, thereby ensuring global consistency. Experimental results demonstrate that the proposed method generates high-quality, temporally coherent 4D objects, achieving performance on par with state-of-the-art diffusion models.

Technology Category

Application Category

📝 Abstract
Generating high-quality 4D objects with spatial-temporal consistency is still formidable. Existing diffusion-based methods often struggle with spatial-temporal inconsistency, as they fail to leverage outputs from all previous timesteps to guide the generation at the current timestep. Therefore, we propose a Spatial-Temporal State Propagation AutoRegressive Model (4DSTAR), which generates 4D objects maintaining temporal-spatial consistency. 4DSTAR formulates the generation problem as the prediction of tokens that represent the 4D object. It consists of two key components: (1) The dynamic spatial-temporal state propagation autoregressive model (STAR) is proposed, which achieves spatial-temporal consistent generation. Unlike standard autoregressive models, STAR divides prediction tokens into groups based on timesteps. It models long-term dependencies by propagating spatial-temporal states from previous groups and utilizes these dependencies to guide generation at the next timestep. To this end, a spatial-temporal container is proposed, which dynamically updating the effective spatial-temporal state features from all historical groups, then updated features serve as conditional features to guide the prediction of the next token group. (2) The 4D VQ-VAE is proposed, which implicitly encodes the 4D structure into discrete space and decodes the discrete tokens predicted by STAR into temporally coherent dynamic 3D Gaussians. Experiments demonstrate that 4DSTAR generates spatial-temporal consistent 4D objects, and achieves performance competitive with diffusion models.
Problem

Research questions and friction points this paper is trying to address.

4D object generation
spatial-temporal consistency
autoregressive model
temporal coherence
dynamic 3D representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-temporal consistency
autoregressive model
4D generation
state propagation
4D VQ-VAE
L
Liying Yang
Macau University of Science and Technology
Jialun Liu
Jialun Liu
Baidu | JLU
long-tailed data learningmetric learning3D generation
J
Jiakui Hu
Peking University
C
Chenhao Guan
Macau University of Science and Technology
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
F
Fangqiu Yi
TeleAI
Chi Zhang
Chi Zhang
Research Scientist, ByteDance Seed
Machine LearningComputer Vision
Y
Yanyan Liang
Macau University of Science and Technology