ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of controllable video generation and editing, which are hindered by the scarcity of paired video data and high training costs. The authors propose a tuning framework that requires no video training data, achieving high-quality and diverse video generation and editing using only a few 2D images with a video diffusion Transformer. The key innovation lies in a spatial-temporal decoupled reparameterization architecture: by disentangling spatial independence from 3D attention mechanisms, it preserves visual fidelity and temporal consistency with negligible parameter overhead. Additionally, a dual-path pipeline combined with timestep-specific noise scheduling enables flexible adaptation to multiple conditioning signals.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.
Problem

Research questions and friction points this paper is trying to address.

video diffusion transformer
controllable video generation
video editing
paired video data scarcity
high computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-free tuning
diffusion transformer
architectural reparameterization
temporal consistency
controllable video editing
🔎 Similar Papers
No similar papers found.