RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing video generation models often suffer from insufficient conditional guidance in their decoders, leading to detail loss and inter-frame inconsistency. This work proposes a conditional video VAE decoder architecture that, for the first time, directly injects a high-fidelity reference image into the decoder. At each upsampling stage, reference image features are fused with denoised latent variables via a reference attention mechanism, enabling plug-and-play integration without requiring fine-tuning. The proposed method substantially enhances generation quality, achieving up to a 2.1 dB improvement in PSNR on benchmarks such as Inter4K, and significantly improves subject-background consistency in VBench I2V evaluations. Furthermore, the approach demonstrates strong generalization capabilities, extending effectively to tasks like style transfer.
📝 Abstract
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.
Problem

Research questions and friction points this paper is trying to address.

video generation
conditional decoding
detail preservation
structural consistency
VAE decoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

conditional video decoding
reference attention
VAE decoder
latent diffusion models
video generation
🔎 Similar Papers
No similar papers found.