FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing training-free long video generation methods suffer from architectural dependencies, quality degradation, error accumulation, and repetitive motions. This work proposes a general inference-time framework that generates long videos via overlapping sliding windows and introduces Tweedie matching to fuse adjacent predictions within overlapping regions, enhanced by manifold constraints and temporal consistency modeling. To preserve fine-grained details, the method employs stochastic early-phase sampling followed by a switch to deterministic ODE sampling along synchronized trajectories. Notably, it is the first to integrate Tweedie matching with manifold constraints for long video synthesis, requiring no retraining and remaining architecture-agnostic. The approach significantly outperforms existing training-free and autoregressive baselines in both temporal coherence and visual fidelity, enabling video synthesis several times longer than the base model’s native window length, and extends naturally to audio-visual joint generation and text-to-3D Gaussian Splatting tasks.

📝 Abstract

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

long video generation

video diffusion models

temporal consistency

inference-time method

manifold constraint

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tweedie matching

manifold constraint

temporal consistency