POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Video diffusion models suffer from low sampling efficiency and difficulty in simultaneously ensuring temporal coherence and visual fidelity—especially for long sequences and large-scale models. To address this, we propose POSE, the first framework enabling stable one-step generation for large-scale video diffusion models. POSE introduces a three-stage mechanism—stability warm-up, unified adversarial equilibrium, and conditional adversarial consistency—to model one-step generation trajectories directly in Gaussian noise space. It integrates two-stage knowledge distillation, self-adversarial training, Nash equilibrium optimization, and conditional consistency constraints. On VBench-I2V, POSE improves semantic alignment and temporal quality by 7.15% on average, reduces inference latency from 1000 to 10 seconds (100× speedup), and achieves generation quality comparable to original multi-step models.

Technology Category

Application Category

📝 Abstract

The field of video diffusion generation faces critical bottlenecks in sampling efficiency, especially for large-scale models and long sequences. Existing video acceleration methods adopt image-based techniques but suffer from fundamental limitations: they neither model the temporal coherence of video frames nor provide single-step distillation for large-scale video models. To bridge this gap, we propose POSE (Phased One-Step Equilibrium), a distillation framework that reduces the sampling steps of large-scale video diffusion models, enabling the generation of high-quality videos in a single step. POSE employs a carefully designed two-phase process to distill video models:(i) stability priming: a warm-up mechanism to stabilize adversarial distillation that adapts the high-quality trajectory of the one-step generator from high to low signal-to-noise ratio regimes, optimizing the video quality of single-step mappings near the endpoints of flow trajectories. (ii) unified adversarial equilibrium: a flexible self-adversarial distillation mechanism that promotes stable single-step adversarial training towards a Nash equilibrium within the Gaussian noise space, generating realistic single-step videos close to real videos. For conditional video generation, we propose (iii) conditional adversarial consistency, a method to improve both semantic consistency and frame consistency between conditional frames and generated frames. Comprehensive experiments demonstrate that POSE outperforms other acceleration methods on VBench-I2V by average 7.15% in semantic alignment, temporal conference and frame quality, reducing the latency of the pre-trained model by 100$ imes$, from 1000 seconds to 10 seconds, while maintaining competitive performance.

Problem

Research questions and friction points this paper is trying to address.

Improving sampling efficiency for large-scale video diffusion models

Enabling single-step high-quality video generation through distillation

Enhancing temporal coherence and frame consistency in generated videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phased distillation framework for single-step video generation

Two-phase process with stability priming and adversarial equilibrium

Conditional adversarial consistency for semantic and frame alignment

🔎 Similar Papers

No similar papers found.