NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of cross-frame attention and the limited quality of implicit video generation models, this paper proposes the Implicit Diffusion Transformer (IDT): it encodes an entire video into a unified neural network weight-based implicit representation and performs diffusion denoising directly in the weight space, thereby eliminating temporal attention entirely. Methodologically, IDT introduces a hypernetwork tokenizer to map videos into the weight space, along with SNR-adaptive loss weighting, scheduled sampling, and a joint architecture for reconstructive weight allocation and upsampling. Evaluated on UCF-101 and Kinetics-600, IDT significantly outperforms existing implicit video generation methods and achieves competitive performance—on metrics including FVD and LPIPS—with state-of-the-art non-implicit diffusion models. It is the first implicit method to enable high-fidelity, end-to-end video generation and arbitrary-frame-rate smooth interpolation.

Technology Category

Application Category

📝 Abstract
We present NeRV-Diffusion, an implicit latent video diffusion model that synthesizes videos via generating neural network weights. The generated weights can be rearranged as the parameters of a convolutional neural network, which forms an implicit neural representation (INR), and decodes into videos with frame indices as the input. Our framework consists of two stages: 1) A hypernetworkbased tokenizer that encodes raw videos from pixel space to neural parameter space, where the bottleneck latent serves as INR weights to decode. 2) An implicit diffusion transformer that denoises on the latent INR weights. In contrast to traditional video tokenizers that encode videos into frame-wise feature maps, NeRV-Diffusion compresses and generates a video holistically as a unified neural network. This enables efficient and high-quality video synthesis via obviating temporal cross-frame attentions in the denoiser and decoding video latent with dedicated decoders. To achieve Gaussian-distributed INR weights with high expressiveness, we reuse the bottleneck latent across all NeRV layers, as well as reform its weight assignment, upsampling connection and input coordinates. We also introduce SNR-adaptive loss weighting and scheduled sampling for effective training of the implicit diffusion model. NeRV-Diffusion reaches superior video generation quality over previous INR-based models and comparable performance to most recent state-of-the-art non-implicit models on real-world video benchmarks including UCF-101 and Kinetics-600. It also brings a smooth INR weight space that facilitates seamless interpolations between frames or videos.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing videos via neural network weight generation
Compressing videos holistically as unified neural representations
Enabling efficient video synthesis without temporal cross-frame attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates videos through neural network weights
Uses implicit diffusion transformer for denoising
Compresses videos holistically as unified neural representation
🔎 Similar Papers
No similar papers found.