Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates the mechanisms influencing human–LLM judgment alignment in human–AI collaborative evaluation, focusing on how task characteristics and AI assistance strategies shape users’ construction and dynamic refinement of evaluation criteria, as well as their model selection behavior. Method: We conducted a controlled human–AI interaction study involving 15 ML practitioners performing 131 real-world evaluation tasks, comparing direct assessment versus pairwise comparison paradigms, augmented by multi-round LLM-assisted judgments and qualitative behavioral analysis. Contribution/Results: We present the first empirical evidence that direct assessment significantly enhances user engagement and criterion-task alignment, facilitating personalized criterion customization, dynamic judgment adjustment, and adaptive model switching. Based on these findings, we propose design principles for front-end evaluation tools tailored to human–AI collaboration. Our work advances low-overhead, interpretable, and task-adaptive AI-assisted evaluation frameworks.

Technology Category

Application Category

📝 Abstract

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. To support this process, effective front-end tools are critical for evaluation. Two common approaches for using LLMs as evaluators are direct assessment and pairwise comparison. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and assessment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and changing the evaluator model. We conclude with recommendations for how systems can better support interactions in LLM-assisted evaluations.

Problem

Research questions and friction points this paper is trying to address.

Aligning human and LLM judgments for evaluations

Reducing cost and time in LLM output assessments

Improving AI-assisted evaluation strategies and tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct assessment for task-specific evaluations

Pairwise comparison for LLM-assisted assessments

Front-end tools to streamline evaluation processes

🔎 Similar Papers

No similar papers found.