🤖 AI Summary
This study investigates the intrinsic difficulty assessment of NLP instances arising from inherent linguistic ambiguity, focusing on the interplay among three uncertainty measures: annotator disagreement, training dynamics, and model confidence. Using comparative experiments across 29 models and three standard NLP datasets, augmented with nonlinear fitting analysis, we uncover—for the first time—a non-monotonic, nonlinear relationship among these measures, thereby disentangling the multidimensional nature of data complexity. Results show that conventional confidence metrics exhibit weak correlation with true instance difficulty; annotator disagreement serves as the strongest proxy for intrinsic difficulty; and high-confidence erroneous predictions are pervasive. These findings challenge the implicit assumption that model confidence equates to reliability, exposing a fundamental flaw in current evaluation paradigms. The work provides both theoretical grounding and empirical evidence for developing more robust NLP evaluation frameworks and model improvement strategies.
📝 Abstract
The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.