Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Aggregated metrics in machine learning benchmarks—such as multi-task average accuracy—lack statistical reliability due to unquantified uncertainty across tasks. Method: This paper introduces a statistical uncertainty quantification framework comprising three novel components: (i) task-weighted standard error visualization, (ii) Bayesian hierarchical modeling to capture inter-task heterogeneity via shared and task-specific priors, and (iii) task-level bootstrap resampling to explicitly model performance variability and weight uncertainty. Contribution/Results: Evaluated on VTAB, the framework reveals models that rank low globally yet significantly outperform others on specific tasks. It endows aggregated metrics with interpretable confidence intervals, enables principled cross-model comparison, and supports diagnostic bias analysis—thereby mitigating over-simplified interpretations of model performance and enhancing benchmark interpretability and robustness.

Technology Category

Application Category

📝 Abstract
Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Performance Metrics
Statistical Uncertainty
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian hierarchical model
bootstrap method
multi-task learning
🔎 Similar Papers
No similar papers found.
R
Rachel Longjohn
University of California Irvine
G
Giri Gopalan
Los Alamos National Laboratory
Emily Casleton
Emily Casleton
Scientist, Los Alamos National Laboratory
statisticsBayesian non-parametricsAI testing and evaluationAI research