Uniform Discrete Diffusion with Metric Path for Video Generation

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Discrete generative models for video synthesis suffer from error accumulation and long-range spatiotemporal inconsistency. To address these limitations, we propose URSA—a novel framework that introduces iterative global optimization over discrete spatiotemporal tokens for the first time in video generation. Our key contributions are: (1) a linearized metric path to alleviate gradient mismatch in discrete token space; (2) a resolution-dependent timestep shifting mechanism, enabling stable modeling at high spatial resolutions and long temporal durations; and (3) an asynchronous temporal fine-tuning strategy that unifies support for diverse tasks—including image-to-video generation and video interpolation. URSA substantially reduces inference steps while achieving state-of-the-art performance among discrete video generators across multiple benchmarks. Moreover, it matches the quality of leading continuous diffusion models, demonstrating competitive fidelity and coherence without relying on continuous latent spaces.

Technology Category

Application Category

📝 Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

Problem

Research questions and friction points this paper is trying to address.

Bridging performance gap between discrete and continuous video generation

Addressing error accumulation in discrete generative modeling

Enabling scalable high-resolution video generation with fewer steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uniform discrete diffusion with metric path

Linearized metric path for efficient scaling

Resolution-dependent timestep shifting mechanism

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling