🤖 AI Summary
Existing multilingual, multitask NLP evaluation benchmarks lack statistically reliable metrics due to unquantified uncertainty from both model stochasticity and data sampling variability.
Method: We propose a resampling framework that jointly models dual sources of variation—model randomness (e.g., weight initialization, training dynamics) and data sampling variability—using bootstrap to construct empirical sampling distributions for standard metrics (e.g., accuracy, BLEU, F1). This enables principled confidence interval estimation for key statistics including means, medians, pairwise model differences, and rankings.
Contribution/Results: Evaluated on multilingual question answering, machine translation, and named entity recognition, our approach significantly improves the precision of performance fluctuation characterization. It enhances comparability and reproducibility across models and languages, offering an interpretable, reproducible paradigm for uncertainty quantification in NLP benchmarking.
📝 Abstract
In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.