🤖 AI Summary
Current AI capability evaluation lacks a formal measurement foundation, suffers from cross-system and cross-method incomparability, and remains disconnected from quantitative risk analysis in engineering safety. To address these issues, this paper proposes a hierarchical AI measurement theory framework that rigorously distinguishes between direct and indirect observables and formally characterizes how AI capability definitions depend on specific measurement operations and scales. Methodologically, the framework integrates classical measurement theory, formal modeling, and quantitative risk analysis techniques, drawing upon established paradigms from engineering and safety science. Its core contribution is the first systematic, calibratable, and traceable taxonomy of AI phenomena and capability representations. This enables standardized, reproducible AI system evaluation—significantly enhancing the reliability and interoperability of assessment outcomes across scientific validation, engineering deployment, and regulatory decision-making.
📝 Abstract
We motivate and outline a programme for a formal theory of measurement of artificial intelligence. We argue that formalising measurement for AI will allow researchers, practitioners, and regulators to: (i) make comparisons between systems and the evaluation methods applied to them; (ii) connect frontier AI evaluations with established quantitative risk analysis techniques drawn from engineering and safety science; and (iii) foreground how what counts as AI capability is contingent upon the measurement operations and scales we elect to use. We sketch a layered measurement stack, distinguish direct from indirect observables, and signpost how these ingredients provide a pathway toward a unified, calibratable taxonomy of AI phenomena.