🤖 AI Summary
This work investigates whether the infinite-width neural network paradigm can be circumvented by directly modeling dynamic learning processes in the space of probability measures. To this end, we propose Gaussian Mixture (GM) layers: neural modules that embed learnable Gaussian mixture distributions into the network architecture and define differentiable distributional dynamics in the measure space via Wasserstein gradient flows and mean-field theory. Unlike conventional parametric layers, GM layers enable end-to-end optimization over probability distributions themselves—bypassing reliance on fixed parameterized architectures—and exhibit distinct dynamical behavior compared to fully connected layers. Empirically, a single GM layer achieves classification accuracy comparable to a two-layer fully connected network on standard benchmarks, validating its effectiveness. This approach establishes a nonparametric, measure-theoretic paradigm for neural network design, shifting focus from weight optimization to distributional evolution in Wasserstein space.
📝 Abstract
The mean-field theory for two-layer neural networks considers infinitely wide networks that are linearly parameterized by a probability measure over the parameter space. This nonparametric perspective has significantly advanced both the theoretical and conceptual understanding of neural networks, with substantial efforts made to validate its applicability to networks of moderate width. In this work, we explore the opposite direction, investigating whether dynamics can be directly implemented over probability measures. Specifically, we employ Gaussian mixture models as a flexible and expressive parametric family of distributions together with the theory of Wasserstein gradient flows to derive training dynamics for such measures. Our approach introduces a new type of layer -- the Gaussian mixture (GM) layer -- that can be integrated into neural network architectures. As a proof of concept, we validate our proposal through experiments on simple classification tasks, where a GM layer achieves test performance comparable to that of a two-layer fully connected network. Furthermore, we examine the behavior of these dynamics and demonstrate numerically that GM layers exhibit markedly different behavior compared to classical fully connected layers, even when the latter are large enough to be considered in the mean-field regime.