🤖 AI Summary
Current LLM evaluations rely on benchmark-average scores, failing to uncover inter-task relationships or the intrinsic nature of model capabilities—leading to redundant tasks and an opaque “capability black box.” This paper introduces factor analysis—the first application of this statistical method to multi-task LLM evaluation—modeling cross-task correlations and extracting interpretable latent variables from performance data of 60 models across 44 diverse tasks. Results show that only 3–5 core latent skills account for the vast majority of performance variance, explicitly revealing task redundancy and the underlying capability structure. Based on this capability decomposition, we construct the first systematic, ability-aware leaderboard, enabling fine-grained model diagnostics, task set pruning, and granular capability profiling. Our approach establishes a new paradigm for LLM evaluation: interpretable, decomposable, and reusable.
📝 Abstract
Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models' wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.