🤖 AI Summary
This study investigates the mechanisms influencing human–LLM judgment alignment in human–AI collaborative evaluation, focusing on how task characteristics and AI assistance strategies shape users’ construction and dynamic refinement of evaluation criteria, as well as their model selection behavior.
Method: We conducted a controlled human–AI interaction study involving 15 ML practitioners performing 131 real-world evaluation tasks, comparing direct assessment versus pairwise comparison paradigms, augmented by multi-round LLM-assisted judgments and qualitative behavioral analysis.
Contribution/Results: We present the first empirical evidence that direct assessment significantly enhances user engagement and criterion-task alignment, facilitating personalized criterion customization, dynamic judgment adjustment, and adaptive model switching. Based on these findings, we propose design principles for front-end evaluation tools tailored to human–AI collaboration. Our work advances low-overhead, interpretable, and task-adaptive AI-assisted evaluation frameworks.
📝 Abstract
Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. To support this process, effective front-end tools are critical for evaluation. Two common approaches for using LLMs as evaluators are direct assessment and pairwise comparison. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and assessment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and changing the evaluator model. We conclude with recommendations for how systems can better support interactions in LLM-assisted evaluations.