🤖 AI Summary
While classical underparameterized ensembles improve generalization, modern overparameterized neural network ensembles often exhibit no such benefit—yet the underlying mechanism remains unclear.
Method: Leveraging random feature regression, we rigorously prove that infinitely wide overparameterized ensembles are pointwise equivalent to a single infinitely wide model. By modeling with both ridgeless and small-ridge regression, we decompose generalization error and prediction variance.
Contribution: We provide the first theoretical demonstration that overparameterized ensembles achieve generalization performance nearly identical to that of a single large model. Crucially, we show that prediction variance primarily reflects increased model capacity—not epistemic uncertainty—thereby challenging the long-standing heuristic that “ensembles must outperform individual models.” Our analysis reveals that in overparameterized regimes, ensemble averaging fails to reduce variance meaningfully, as the constituent models become highly correlated in their predictions. This fundamentally revises conventional wisdom on ensemble benefits in deep learning.
📝 Abstract
Classic tree-based ensembles generalize better than any single decision tree. In contrast, recent empirical studies find that modern ensembles of (overparameterized) neural networks may not provide any inherent generalization advantage over single but larger neural networks. This paper clarifies how modern overparameterized ensembles differ from their classic underparameterized counterparts, using ensembles of random feature (RF) regressors as a basis for developing theory. In contrast to the underparameterized regime, where ensembling typically induces regularization and increases generalization, we prove that infinite ensembles of overparameterized RF regressors become pointwise equivalent to (single) infinite-width RF regressors. This equivalence, which is exact for ridgeless models and approximate for small ridge penalties, implies that overparameterized ensembles and single large models exhibit nearly identical generalization. As a consequence, we can characterize the predictive variance amongst ensemble members, and demonstrate that it quantifies the expected effects of increasing capacity rather than capturing any conventional notion of uncertainty. Our results challenge common assumptions about the advantages of ensembles in overparameterized settings, prompting a reconsideration of how well intuitions from underparameterized ensembles transfer to deep ensembles and the overparameterized regime.