IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluations rely on benchmark-average scores, failing to uncover inter-task relationships or the intrinsic nature of model capabilities—leading to redundant tasks and an opaque “capability black box.” This paper introduces factor analysis—the first application of this statistical method to multi-task LLM evaluation—modeling cross-task correlations and extracting interpretable latent variables from performance data of 60 models across 44 diverse tasks. Results show that only 3–5 core latent skills account for the vast majority of performance variance, explicitly revealing task redundancy and the underlying capability structure. Based on this capability decomposition, we construct the first systematic, ability-aware leaderboard, enabling fine-grained model diagnostics, task set pruning, and granular capability profiling. Our approach establishes a new paradigm for LLM evaluation: interpretable, decomposable, and reusable.

Technology Category

Application Category

📝 Abstract
Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models' wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.
Problem

Research questions and friction points this paper is trying to address.

Unclear interpretation of benchmark scores for LLM skills
Lack of understanding task relationships and redundancies
Need for holistic evaluation beyond averaged benchmark scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

Factor analysis identifies latent LLM skills
Comprehensive leaderboard evaluates 60 LLMs
Tools for task redundancy and model selection
🔎 Similar Papers
No similar papers found.
A
Aviya Maimon
Bar-Ilan University
A
Amir DN Cohen
OriginAI
G
Gal Vishne
Data Science Institute, Columbia University
Shauli Ravfogel
Shauli Ravfogel
Faculty Fellow, NYU
NLPMachine Learning
Reut Tsarfaty
Reut Tsarfaty
Bar-Ilan University
Natural Language ProcessingComputational LinguisticsArtificial Inteligence