🤖 AI Summary
Contemporary image generation models achieve high visual fidelity but rely on complex parametric architectures and extensive training, resulting in opaque generative mechanisms. Method: This paper proposes a zero-shot, fully non-parametric generative framework that exploits inherent properties of natural images—spatial non-stationarity, low-level regularity, and high-level semantic structure—defining pixel-wise conditional distributions solely via local contextual windows, enabling interpretable, pixel-level sampling without optimization. Contribution/Results: The approach uncovers a “part-to-whole” generalization principle, offering a minimal theoretical account of natural image structure. It generates visually realistic samples on MNIST and CIFAR-10, with fully traceable and reproducible inference. Crucially, it achieves, for the first time, simultaneous high-fidelity synthesis and mechanistic interpretability—bridging generative performance with analytical transparency—and substantially advances the explainability and theoretical tractability of generative models.
📝 Abstract
Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.