Substance over Style: Evaluating Proactive Conversational Coaching Agents

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses core challenges in proactive conversational coaching agents—namely, ill-defined, dynamically evolving goals and the conflation of subjective and objective evaluation criteria. We design and empirically evaluate five multi-turn, mixed-initiative coaching agents. Based on 155 real-user dialogues, we first demonstrate that functional completeness is the primary determinant of user experience, while stylistic elements unsupported by underlying functionality significantly reduce acceptance. Crucially, inter-rater agreement among user feedback, health expert assessments, and large language model evaluations is extremely low (Cohen’s κ < 0.3), exposing structural flaws in current evaluation paradigms. To address this, we propose a novel human-AI collaborative triadic evaluation framework that integrates multi-agent system design, mixed-initiative interaction modeling, and first-person experiential data collection. Our work establishes a reproducible benchmark and offers methodological insights for developing trustworthy, evidence-based coaching systems.

Technology Category

Application Category

📝 Abstract
While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn coaching agents with evolving goals
Assessing subjective user feedback versus expert evaluations
Improving design of human-centered conversational NLP systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn coaching agents with distinct styles
User study with 155 first-person feedbacks
Comparison of user and expert evaluations
🔎 Similar Papers
No similar papers found.