The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Text-to-video (T2V) generation faces dual bottlenecks: weak semantic understanding in diffusion-based models and low visual fidelity with error accumulation in autoregressive language-model-based approaches. To address these, we propose LanDiff—a novel two-stage collaborative framework. First, a 3D semantic tokenizer compresses visual features by 14,000×, enabling efficient semantic representation. Second, a 5B-parameter large language model generates structured semantic sequences, which guide a streaming spatiotemporal diffusion model for fine-grained video reconstruction. This establishes a “coarse semantic generation → fine-grained video reconstruction” paradigm. Evaluated on VBench, LanDiff achieves 85.43, surpassing commercial models including Hunyuan Video (13B) and Sora, and significantly improves long-video generation quality. To date, it sets the state-of-the-art performance among open-source T2V models.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $sim$14,000$ imes$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Integrates language and diffusion models for video generation

Addresses visual quality and semantic understanding limitations

Achieves high compression and high-fidelity video output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic tokenizer compresses 3D visual features

Language model generates high-level semantic tokens

Streaming diffusion model refines coarse semantics

🔎 Similar Papers

No similar papers found.