On the Measure of a Model: From Intelligence to Generality

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluation relies on ill-defined notions of “intelligence” (e.g., ARC, Raven benchmarks), which lack conceptual clarity and fail to reliably predict model performance on practical downstream tasks such as question answering, summarization, and code generation—leading to a misalignment between evaluation metrics and real-world utility. Method: The authors propose replacing “intelligence” with “generality” as the core evaluation paradigm, formalizing it via a multi-task learning framework that quantifies both breadth (task diversity coverage) and stability (performance consistency) across heterogeneous tasks. This framework integrates conceptual analysis and theoretical modeling to enable measurable, reproducible generality assessment. Contribution/Results: Empirical validation demonstrates that generality metrics consistently outperform traditional intelligence-based benchmarks in predicting real-world task performance. The proposed paradigm offers a more robust, application-oriented, and theoretically grounded metric for assessing AI progress.

Technology Category

Application Category

📝 Abstract
Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM intelligence lacks stable definition and practical utility
Optimizing intelligence benchmarks risks misalignment with real-world performance
Proposing generality as foundation for assessing capability across diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifting evaluation from intelligence to generality
Framing generality as multitask learning problem
Linking evaluation to measurable performance breadth
🔎 Similar Papers
No similar papers found.