Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the limitations of current AI system evaluations, which often suffer from inconsistent methodologies and metrics that yield incomparable results and poor alignment with real-world contexts and human needs. To bridge this gap, the authors propose a reproducible three-stage scenario generation pipeline that integrates human-centered design, operational feasibility, and methodological transparency. The approach begins by eliciting authentic AI use cases from domain experts via structured use case worksheets, then leverages large language model prompt engineering combined with iterative human review to transform these into human-oriented evaluation scenarios. A validation rubric is developed to assess scenario quality. Applied in the financial services sector, the method successfully distilled six high-level AI use case categories and produced 107 validated evaluation scenarios, substantially enhancing the consistency, comparability, and real-world relevance of AI assessments.

📝 Abstract

AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.

Problem

Research questions and friction points this paper is trying to address.

AI evaluation

apples-to-apples comparison

evaluation scenarios

human-centered design

methodological transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

evaluation scenarios

human-centered design

structured use case elicitation