🤖 AI Summary
This work addresses the "black-box" nature of convolutional neural networks (CNNs) in solving image inverse problems by proposing LE-MMSE, the first analytically tractable theoretical framework that explicitly incorporates CNN inductive biases. Built upon minimum mean square error (MMSE) estimation, LE-MMSE formally integrates translation equivariance and local receptive field constraints to yield an interpretable and solvable inverse problem model. Theoretical analysis elucidates the fundamental distinction between physics-aware and physics-agnostic estimators and clarifies the role of high-density regions in the training distribution. Extensive experiments across diverse inverse problems, datasets, and mainstream architectures—including U-Net, ResNet, and PatchMLP—demonstrate remarkable alignment between theoretical predictions and actual CNN outputs, achieving PSNR values consistently above 25 dB, thereby validating the effectiveness and broad applicability of the LE-MMSE framework.
📝 Abstract
Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).