🤖 AI Summary
Conventional diffusion models rely on isotropic Gaussian noise as a fixed, unstructured initial prior, limiting semantic expressiveness and external controllability. Method: We propose NoiseAR—the first framework to model the initial noise as a learnable, conditional, autoregressive distribution. It generates noise parameters token- or patch-wise, conditioned on textual prompts in an iterative, sequential manner. Contribution/Results: NoiseAR enables fine-grained semantic control over the diffusion starting point and natively integrates with probabilistic optimization frameworks such as reinforcement learning. Extensive experiments demonstrate that NoiseAR significantly improves both generation quality and text–image alignment across multiple benchmarks, outperforming standard fixed Gaussian initialization and existing controllable initialization methods.
📝 Abstract
Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior's parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at https://github.com/HKUST-SAIL/NoiseAR/