ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of existing 3D attention–based video diffusion models in generating ultra-high-definition videos and the performance degradation caused by modality gaps when directly fine-tuning on high-resolution images. To overcome these challenges, the authors propose a purely image-driven adaptation framework that decouples modality alignment from spatial extrapolation. The approach employs a two-stage Relay LoRA fine-tuning strategy and introduces a high-frequency-aware reconstruction loss. Notably, the method requires no high-resolution video training data yet outperforms state-of-the-art video-trained models by 0.8 points on VBench, marking the first successful generation of detail-rich, high-fidelity ultra-high-definition videos using only image data.

Technology Category

Application Category

📝 Abstract
Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.
Problem

Research questions and friction points this paper is trying to address.

video synthesis
ultra-high-resolution
diffusion models
modality gap
image-to-video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relay LoRA
modality alignment
spatial extrapolation
high-frequency-aware training
image-to-video synthesis
🔎 Similar Papers
No similar papers found.
Y
Yunfeng Wu
School of Artificial Intelligence, Shanghai Jiao Tong University; Xi’an Jiaotong-Liverpool University
H
Hongying Cheng
School of Artificial Intelligence, Shanghai Jiao Tong University; Jilin University
Zihao He
Zihao He
Shanghai Jiao Tong University
Robot Learning
Songhua Liu
Songhua Liu
Shanghai Jiao Tong University
Computer VisionMachine Learning