TokensGen: Harnessing Condensed Tokens for Long Video Generation

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bottlenecks and long-term temporal inconsistency in diffusion-based long-video generation, this paper proposes TokensGen—a two-stage generative framework. First, a text-to-token model compresses input text into a compact, global semantic token sequence; second, a token-to-video model decodes these tokens into video frames, enforcing cross-segment semantic consistency. To further improve inter-segment transition smoothness under constrained GPU memory, we introduce an adaptive FIFO-Diffusion inference strategy. The entire framework—including the video tokenizer, Transformer architecture, and inference mechanism—is end-to-end optimized atop a pre-trained short-video diffusion model. Experiments demonstrate that TokensGen significantly enhances temporal coherence and content fidelity in multi-minute, high-resolution video generation, while maintaining tractable computational overhead. This work establishes an efficient, scalable paradigm for long-video synthesis.

Technology Category

Application Category

📝 Abstract
Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at https://vicky0522.github.io/tokensgen-webpage/ .
Problem

Research questions and friction points this paper is trying to address.

Addresses memory bottlenecks in long video generation
Ensures long-term consistency across video clips
Enhances smooth transitions between adjacent clips
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with condensed tokens
Token-to-Video and Text-to-Token models
Adaptive FIFO-Diffusion for smooth transitions
🔎 Similar Papers
No similar papers found.
Wenqi Ouyang
Wenqi Ouyang
MMLab@NTU
Computer VisionImage SynthesisLow-level Vision
Z
Zeqi Xiao
S-Lab, Nanyang Technological University
Danni Yang
Danni Yang
Xiamen University
Multimodal LearningVideo Editing
Y
Yifan Zhou
S-Lab, Nanyang Technological University
S
Shuai Yang
Wangxuan Institute of Computer Technology, Peking University
L
Lei Yang
SenseTime Research
Jianlou Si
Jianlou Si
alibaba-inc.com
MLLMGenAIAGIEmbodied AI
Xingang Pan
Xingang Pan
Assistant Professor, MMLab@NTU, Nanyang Technological University
Computer VisionDeep LearningComputer Graphics