🤖 AI Summary
This paper addresses core challenges in proactive conversational coaching agents—namely, ill-defined, dynamically evolving goals and the conflation of subjective and objective evaluation criteria. We design and empirically evaluate five multi-turn, mixed-initiative coaching agents. Based on 155 real-user dialogues, we first demonstrate that functional completeness is the primary determinant of user experience, while stylistic elements unsupported by underlying functionality significantly reduce acceptance. Crucially, inter-rater agreement among user feedback, health expert assessments, and large language model evaluations is extremely low (Cohen’s κ < 0.3), exposing structural flaws in current evaluation paradigms. To address this, we propose a novel human-AI collaborative triadic evaluation framework that integrates multi-agent system design, mixed-initiative interaction modeling, and first-person experiential data collection. Our work establishes a reproducible benchmark and offers methodological insights for developing trustworthy, evidence-based coaching systems.
📝 Abstract
While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.