ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models face a critical memory bottleneck during pixel-space fine-tuning, as their full-sequence recursive processing causes GPU memory consumption to grow linearly with video length, hindering scalability to long-duration or high-resolution videos. To address this, this work proposes ChopGrad, a method that truncates backpropagation by confining gradient computation to local frame windows. This approach maintains global temporal consistency while reducing memory overhead to a constant level, independent of video length. ChopGrad enables, for the first time, efficient pixel-level fine-tuning of video diffusion models, effectively overcoming the traditional memory barrier. The method achieves state-of-the-art performance across diverse tasks, including video super-resolution, inpainting, neural rendering enhancement, and controllable driving video generation.

Technology Category

Application Category

📝 Abstract
Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
Problem

Research questions and friction points this paper is trying to address.

video diffusion
pixel-wise losses
memory cost
fine-tuning
backpropagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChopGrad
truncated backpropagation
video diffusion models
pixel-wise losses
constant memory training
🔎 Similar Papers