Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

πŸ“… 2025-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In video diffusion models, the initial noise seed critically influences both generation quality and prompt alignment; however, existing methods rely on external priors while neglecting discriminative internal signals. To address this, we propose ANSE, the first framework that actively selects optimal noise seeds by leveraging the model’s internal attention mechanisms. Specifically, we design the Bayesian Attention-based Normalized Shannon Entropy (BANSA) acquisition function to quantify uncertainty in attention distributions across multiple samplings, and introduce a Bernoulli masking approximation to efficiently activate salient attention layers within a single diffusion step. Evaluated on CogVideoX-2B and CogVideoX-5B, ANSE consistently improves video fidelity and temporal coherence, incurring only 8% and 13% additional inference overhead, respectively. ANSE is plug-and-play, requires no architectural modification or retraining, and demonstrates strong cross-model generalization.

Technology Category

Application Category

πŸ“ Abstract
The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/
Problem

Research questions and friction points this paper is trying to address.

Initial noise choice impacts video diffusion model quality
Existing methods ignore internal model signals for noise preference
Propose ANSE to select high-quality noise via attention uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Active Noise Selection via Attention
Bernoulli-masked approximation for efficient inference
Quantifies attention-based uncertainty for noise selection
πŸ”Ž Similar Papers
No similar papers found.