Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing autoregressive video diffusion models suffer from temporal repetition, motion drift, and deceleration during long-sequence streaming generation; directly adapting StreamingLLM-style attention pooling further degrades fidelity and induces dynamic stagnation. This paper proposes Deep Forcing—a training-free method for ultra-long video extrapolation—achieved via deep contextual stabilization and critical information preservation. Its core innovations are the first integration of Deep Sink (a sliding-window persistent sink token mechanism) with Participative Compression (importance-aware KV pruning coupled with temporal RoPE phase realignment), ensuring long-term temporal consistency and real-time inference without fine-tuning. Experiments demonstrate support for over 12× temporal extrapolation (e.g., 5s → 60+ s), outperforming LongLive and RollingForcing in image quality, aesthetic score, and motion richness.

Technology Category

Application Category

📝 Abstract

Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Problem

Research questions and friction points this paper is trying to address.

Address temporal repetition, drift, and motion deceleration in long video generation

Prevent fidelity degradation and motion stagnation from naive attention sink methods

Enable training-free long video extrapolation with improved quality and consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Sink stabilizes context with persistent tokens and RoPE realignment

Participative Compression prunes KV cache via importance-aware token preservation

Training-free mechanisms enable long video extrapolation with real-time generation

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31

TikTok

San Jose, California

Video Algorithms Intern, Video Coding (Gaussian Splatting), Fall 2026

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

Authors to Follow