Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the significant degradation in visual quality when generating long videos using pretrained short-clip video diffusion models, primarily caused by out-of-distribution (O.O.D.) shifts in frame-level relative positions and context lengths. To mitigate this issue without additional training, the authors propose FreeLOC, a novel framework that dynamically identifies critical Transformer layers via a layer-adaptive probing mechanism and applies targeted corrections through Video Relative Position Recoding (VRPR) and Tiered Sparse Attention (TSA). Experimental results demonstrate that FreeLOC substantially outperforms existing training-free approaches in both temporal consistency and visual fidelity, achieving state-of-the-art performance under zero-shot long-video generation settings.

Technology Category

Application Category

📝 Abstract

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

Problem

Research questions and friction points this paper is trying to address.

long video generation

out-of-distribution

video diffusion models

temporal consistency

visual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-adaptive

out-of-distribution correction

video diffusion models