🤖 AI Summary
Current LLM evaluation relies on ill-defined notions of “intelligence” (e.g., ARC, Raven benchmarks), which lack conceptual clarity and fail to reliably predict model performance on practical downstream tasks such as question answering, summarization, and code generation—leading to a misalignment between evaluation metrics and real-world utility.
Method: The authors propose replacing “intelligence” with “generality” as the core evaluation paradigm, formalizing it via a multi-task learning framework that quantifies both breadth (task diversity coverage) and stability (performance consistency) across heterogeneous tasks. This framework integrates conceptual analysis and theoretical modeling to enable measurable, reproducible generality assessment.
Contribution/Results: Empirical validation demonstrates that generality metrics consistently outperform traditional intelligence-based benchmarks in predicting real-world task performance. The proposed paradigm offers a more robust, application-oriented, and theoretically grounded metric for assessing AI progress.
📝 Abstract
Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.