🤖 AI Summary
This work addresses the challenge of effectively aggregating predictive distributions in deep ensembles to enhance both performance and reliability. From a log-likelihood perspective, the authors systematically analyze generalized mean-normalized aggregation and establish, for the first time, a unified theoretical framework that characterizes the behavior of aggregation across different orders \( r \). They rigorously prove that the aggregated prediction strictly outperforms any individual model if and only if \( r \in [0,1] \), thereby providing a solid theoretical foundation for the empirical success of linear pooling (\( r=1 \)) and geometric pooling (\( r \to 0 \)). Extensive experiments on image and text classification benchmarks confirm the practical relevance and effectiveness of the proposed theory.
📝 Abstract
Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.