From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual, multitask NLP evaluation benchmarks lack statistically reliable metrics due to unquantified uncertainty from both model stochasticity and data sampling variability. Method: We propose a resampling framework that jointly models dual sources of variation—model randomness (e.g., weight initialization, training dynamics) and data sampling variability—using bootstrap to construct empirical sampling distributions for standard metrics (e.g., accuracy, BLEU, F1). This enables principled confidence interval estimation for key statistics including means, medians, pairwise model differences, and rankings. Contribution/Results: Evaluated on multilingual question answering, machine translation, and named entity recognition, our approach significantly improves the precision of performance fluctuation characterization. It enhances comparability and reproducibility across models and languages, offering an interpretable, reproducible paradigm for uncertainty quantification in NLP benchmarking.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.
Problem

Research questions and friction points this paper is trying to address.

Quantifying uncertainty in multilingual and multitask NLP benchmarks
Measuring statistical variability from model and data sources
Computing sampling distributions for leaderboard metrics and rankings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resampling methods quantify metric uncertainty
Model and data sources explain performance variation
Sampling distributions computed for leaderboard metrics
🔎 Similar Papers
No similar papers found.