🤖 AI Summary
This work addresses the degradation of user experience in large language model (LLM) agents caused by suboptimal interaction patterns—such as excessive confirmation requests, opaque reasoning, and misaligned pacing—and the lack of evaluation frameworks that account for interaction quality and user preference alignment. To bridge this gap, we propose the Interaction-as-a-Tool (IaaT) paradigm, which formalizes interactive behaviors as structured tool calls and introduces a configurable environment, PrefIx, to jointly optimize task performance and interaction experience. We define 31 user preferences across 14 attributes and, for the first time, treat user experience as a core evaluation metric alongside task accuracy. Using a composite LLM-as-a-Judge mechanism across seven dimensions, our experiments demonstrate that preference-aware agents improve user experience by 7.6% and preference alignment by 18.5%, with the evaluation framework exhibiting high inter-rater reliability (ICC > 0.79), internal consistency (α = 0.943), and strong correlation with human judgments (ρ = 0.52–0.78).
📝 Abstract
LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC>0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.