🤖 AI Summary
Current LLM evaluation relies on benchmark scores whose task labels (e.g., “reasoning”, “commonsense”) often misalign with the actual cognitive capabilities required, leading to ambiguous capability attribution.
Method: We propose the first interpretable diagnostic framework that decomposes performance across 10 mainstream benchmarks into contributions from 10 fine-grained cognitive abilities. Our approach innovatively integrates gradient-based importance scoring with targeted parameter ablation to define Ability Influence Scores (AIS), quantifying each ability’s causal contribution to model performance.
Contribution/Results: Experiments reveal that most benchmarks rely on synergistic multi-ability engagement—not isolated skills—and that datasets sharing identical high-level labels exhibit markedly divergent ability compositions. Notably, code generation tasks benefit substantially from holistic capability enhancement. The framework establishes a new, auditable, and attributable paradigm for LLM capability analysis, enabling precise, mechanism-aware evaluation beyond aggregate benchmark scores.
📝 Abstract
Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.