On the Measure of a Model: From Intelligence to Generality

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current LLM evaluation relies on ill-defined notions of “intelligence” (e.g., ARC, Raven benchmarks), which lack conceptual clarity and fail to reliably predict model performance on practical downstream tasks such as question answering, summarization, and code generation—leading to a misalignment between evaluation metrics and real-world utility. Method: The authors propose replacing “intelligence” with “generality” as the core evaluation paradigm, formalizing it via a multi-task learning framework that quantifies both breadth (task diversity coverage) and stability (performance consistency) across heterogeneous tasks. This framework integrates conceptual analysis and theoretical modeling to enable measurable, reproducible generality assessment. Contribution/Results: Empirical validation demonstrates that generality metrics consistently outperform traditional intelligence-based benchmarks in predicting real-world task performance. The proposed paradigm offers a more robust, application-oriented, and theoretically grounded metric for assessing AI progress.

Technology Category

Application Category

📝 Abstract

Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM intelligence lacks stable definition and practical utility

Optimizing intelligence benchmarks risks misalignment with real-world performance

Proposing generality as foundation for assessing capability across diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifting evaluation from intelligence to generality

Framing generality as multitask learning problem

Linking evaluation to measurable performance breadth

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?