🤖 AI Summary
Conventional machine learning often neglects physical principles, leading to suboptimal generalization and limited theoretical grounding for optimization.
Method: We model neural networks as one-dimensional non-interacting particle systems and introduce statistical mechanical entropy to characterize model states. Using the Wang–Landau algorithm, we construct entropy–generalization landscapes for million-parameter networks.
Contribution/Results: We discover a pronounced “entropy advantage”: high-entropy solutions consistently outperform low-entropy minima found by SGD and other standard optimizers—by up to 2.3× in narrow networks—challenging the universality assumption of SGD. Across arithmetic reasoning, tabular data, image classification, and language modeling, high-entropy states yield average test accuracy gains of 3.2–7.8%. This work establishes a new physics-informed paradigm for optimizer design, grounded in statistical mechanics and providing both theoretical justification and empirical validation for entropy-driven optimization.
📝 Abstract
While the 2024 Nobel Prize in Physics ignites a worldwide discussion on the origins of neural networks and their foundational links to physics, modern machine learning research predominantly focuses on computational and algorithmic advancements, overlooking a picture of physics. Here we introduce the concept of entropy into neural networks by reconceptualizing them as hypothetical physical systems where each parameter is a non-interacting 'particle' within a one-dimensional space. By employing a Wang-Landau algorithms, we construct the neural networks' (with up to 1 million parameters) entropy landscapes as functions of training loss and test accuracy (or loss) across four distinct machine learning tasks, including arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of extit{entropy advantage}, where the high-entropy states generally outperform the states reached via classical training optimizer like stochastic gradient descent. We also find this advantage is more pronounced in narrower networks, indicating a need of different training optimizers tailored to different sizes of neural networks.