IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current LLM evaluations rely on benchmark-average scores, failing to uncover inter-task relationships or the intrinsic nature of model capabilities—leading to redundant tasks and an opaque “capability black box.” This paper introduces factor analysis—the first application of this statistical method to multi-task LLM evaluation—modeling cross-task correlations and extracting interpretable latent variables from performance data of 60 models across 44 diverse tasks. Results show that only 3–5 core latent skills account for the vast majority of performance variance, explicitly revealing task redundancy and the underlying capability structure. Based on this capability decomposition, we construct the first systematic, ability-aware leaderboard, enabling fine-grained model diagnostics, task set pruning, and granular capability profiling. Our approach establishes a new paradigm for LLM evaluation: interpretable, decomposable, and reusable.

Technology Category

Application Category

📝 Abstract

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models' wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.

Problem

Research questions and friction points this paper is trying to address.

Unclear interpretation of benchmark scores for LLM skills

Lack of understanding task relationships and redundancies

Need for holistic evaluation beyond averaged benchmark scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Factor analysis identifies latent LLM skills

Comprehensive leaderboard evaluates 60 LLMs

Tools for task redundancy and model selection

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates