Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing streaming video generation methods suffer from initial-frame duplication and motion degradation due to reliance on a fixed initial frame and sliding-window attention. To address this, we propose the EMA-Sink mechanism, which employs exponential moving average to continuously fuse features exiting the attention window, thereby enhancing long-term context modeling while preserving short-term dynamics. Additionally, we introduce Rewarded Distribution Matching Distillation (Re-DMD), a reward-based knowledge distillation framework that leverages vision-language models to quantify inter-frame dynamic intensity and assign sample-specific weights during training. Together, these innovations effectively mitigate motion attenuation. Our approach achieves state-of-the-art performance on standard benchmarks: enabling high-fidelity streaming video generation at 23.1 FPS on a single H100 GPU, substantially suppressing frame duplication while maintaining both long-term temporal coherence and fine-grained visual fidelity.

Technology Category

Application Category

📝 Abstract

Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-dependence on initial frames in video generation.

Enhances motion dynamics and prevents copied static content.

Improves streaming video generation efficiency and quality.

Innovation

Methods, ideas, or system contributions that make the work stand out.

EMA-Sink tokens capture long-term context and recent dynamics

Rewarded Distribution Matching Distillation prioritizes dynamic content samples

Framework enables high-quality streaming video generation at 23.1 FPS

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling