Evaluations at Work: Measuring the Capabilities of GenAI in Use

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing AI benchmarks predominantly employ single-turn, static evaluations, failing to capture the multi-turn, dynamic nature of human-AI collaboration. Method: This paper proposes a task-oriented, multi-turn human-AI collaboration evaluation framework, using financial valuation as a representative domain. It introduces the novel “information frontier” metric—integrating semantic similarity, structural coherence, intra-turn diversity, and alignment with user knowledge—to reveal that excessive novelty pursuit degrades task performance. The methodology combines subtask dependency modeling, composite multidimensional metrics (semantic/lexical overlap, numerical fidelity), dialogue structure analysis, and quantified knowledge distance estimation. Contribution/Results: Empirical evaluation shows that while LLM-based content integration improves output quality, it is significantly hindered by response incoherence, subtask fragmentation, and large knowledge distance. The framework delivers an interpretable, intervention-aware optimization pathway for AI-augmented workflows.

Technology Category

Application Category

📝 Abstract

Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users' strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the"information frontier"reflecting the alignment between AI outputs and users' working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users' existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.

Problem

Research questions and friction points this paper is trying to address.

Current benchmarks ignore messy human-AI collaboration dynamics

Proposing metrics to evaluate LLM performance and user strategies

Assessing how AI output alignment affects task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposing tasks into interdependent subtasks for evaluation

Developing metrics including semantic similarity and coherence

Measuring AI-user alignment via information frontier

🔎 Similar Papers

No similar papers found.