🤖 AI Summary
This work addresses the high computational cost of existing 3D attention–based video diffusion models in generating ultra-high-definition videos and the performance degradation caused by modality gaps when directly fine-tuning on high-resolution images. To overcome these challenges, the authors propose a purely image-driven adaptation framework that decouples modality alignment from spatial extrapolation. The approach employs a two-stage Relay LoRA fine-tuning strategy and introduces a high-frequency-aware reconstruction loss. Notably, the method requires no high-resolution video training data yet outperforms state-of-the-art video-trained models by 0.8 points on VBench, marking the first successful generation of detail-rich, high-fidelity ultra-high-definition videos using only image data.
📝 Abstract
Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.