The Universality Lens: Why Even Highly Over-Parametrized Models Learn Well

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This paper addresses the fundamental question in deep learning: *why do overparameterized models generalize well?* Methodologically, it introduces “hypothesis weight”—a novel complexity measure defined as the cumulative prior probability within a KL-divergence neighborhood—to quantify model simplicity, and proves that generalization depends on the weight of simple hypotheses rather than the overall capacity of the hypothesis space. The approach integrates Bayesian mixture learning, logarithmic loss, near-uniform priors, Langevin dynamics SGD, and ensemble analysis, establishing a non-uniform regret bound. The resulting theoretical framework unifies explanations for phenomena including flat minima and knowledge distillation, and applies broadly across online, batch, and supervised learning paradigms. Crucially, it provides the first principled, information-theoretic and learning-theoretic foundation for generalization in large-scale models.

Technology Category

Application Category

📝 Abstract

A fundamental question in modern machine learning is why large, over-parameterized models, such as deep neural networks and transformers, tend to generalize well, even when their number of parameters far exceeds the number of training samples. We investigate this phenomenon through the lens of information theory, grounded in universal learning theory. Specifically, we study a Bayesian mixture learner with log-loss and (almost) uniform prior over an expansive hypothesis class. Our key result shows that the learner's regret is not determined by the overall size of the hypothesis class, but rather by the cumulative probability of all models that are close, in Kullback-Leibler divergence distance, to the true data-generating process. We refer to this cumulative probability as the weight of the hypothesis. This leads to a natural notion of model simplicity: simple models are those with large weight and thus require fewer samples to generalize, while complex models have small weight and need more data. This perspective provides a rigorous and intuitive explanation for why over-parameterized models often avoid overfitting: the presence of simple hypotheses allows the posterior to concentrate on them when supported by the data. We further bridge theory and practice by recalling that stochastic gradient descent with Langevin dynamics samples from the correct posterior distribution, enabling our theoretical learner to be approximated using standard machine learning methods combined with ensemble learning. Our analysis yields non-uniform regret bounds and aligns with key practical concepts such as flat minima and model distillation. The results apply broadly across online, batch, and supervised learning settings, offering a unified and principled understanding of the generalization behavior of modern AI systems.

Problem

Research questions and friction points this paper is trying to address.

Why over-parameterized models generalize well despite excess parameters

Analyzing generalization via information theory and Bayesian mixture learners

Linking theoretical simplicity to practical concepts like flat minima

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian mixture learner with uniform prior

Regret determined by hypothesis weight

SGD with Langevin dynamics approximates posterior

🔎 Similar Papers

Benign Overfitting in Token Selection of Attention Mechanism