Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current LLM evaluation relies on benchmark scores whose task labels (e.g., “reasoning”, “commonsense”) often misalign with the actual cognitive capabilities required, leading to ambiguous capability attribution. Method: We propose the first interpretable diagnostic framework that decomposes performance across 10 mainstream benchmarks into contributions from 10 fine-grained cognitive abilities. Our approach innovatively integrates gradient-based importance scoring with targeted parameter ablation to define Ability Influence Scores (AIS), quantifying each ability’s causal contribution to model performance. Contribution/Results: Experiments reveal that most benchmarks rely on synergistic multi-ability engagement—not isolated skills—and that datasets sharing identical high-level labels exhibit markedly divergent ability compositions. Notably, code generation tasks benefit substantially from holistic capability enhancement. The framework establishes a new, auditable, and attributable paradigm for LLM capability analysis, enabling precise, mechanism-aware evaluation beyond aggregate benchmark scores.

Technology Category

Application Category

📝 Abstract

Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model's success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.

Problem

Research questions and friction points this paper is trying to address.

Diagnosing if benchmarks accurately measure claimed abilities like reasoning

Quantifying how multiple cognitive abilities contribute to benchmark performance

Explaining why benchmark scores overstate real-world model capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-based importance scoring for ability analysis

Parameter ablation to quantify ability contributions

Diagnostic framework with cognitively grounded abilities

🔎 Similar Papers

No similar papers found.