A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the high computational cost and inefficiency of existing generative world models in capturing diverse, multimodal future video frames. The authors propose DeltaTok and DeltaWorld, which, for the first time, compress the differences between visual foundation model features of consecutive frames into a single “delta” token, achieving extreme compression from 3D spatiotemporal data to a 1D sequence. By integrating feature-space modeling, delta token encoding, parallel multi-hypothesis training, and a lightweight architecture, the method generates diverse and photorealistic future frames in a single forward pass. Experiments demonstrate that the model reduces parameters by over 35× and FLOPs by 2000× while significantly outperforming current approaches on dense prediction tasks.

Technology Category

Application Category

📝 Abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

Problem

Research questions and friction points this paper is trying to address.

generative world modeling

video prediction

future state anticipation

computational efficiency

diverse forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeltaTok

generative world modeling

vision foundation model