Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses potential subgroup discrimination—particularly against individuals with low health literacy or high self-efficacy—in multi-turn LLM-based health coaching, where personalized decision-making may inadvertently exacerbate disparities. To this end, we propose a subgroup-aware offline policy evaluation (OPE) framework. Methodologically, we design a factorized decision head decoupling tool selection from interaction style, construct a latent trait prototypical simulator, and incorporate an early information-gain reward to accelerate user trait identification. We further integrate a lightweight simulator with typified rewards—objective tool efficacy and subjective satisfaction—to jointly model multidimensional objectives. Experiments reveal that conventional uniform reweighting strategies improve aggregate log-value but degrade performance for vulnerable subgroups. In contrast, our approach significantly mitigates subgroup performance gaps, reduces trait identification latency, and improves both goal achievement rate and pass@3 accuracy—establishing a novel paradigm for fair, interpretable LLM-driven health interventions.

Technology Category

Application Category

📝 Abstract

We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

Problem

Research questions and friction points this paper is trying to address.

Evaluating personalized LLM health coaching policies through offline methods

Identifying subgroup performance disparities in multi-turn health interventions

Developing archetype-aware decision heads to improve goal success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freeze generator and learn subgroup-aware decision heads

Use factorized decision heads for tool and style

Apply hidden archetypes in lightweight simulator

🔎 Similar Papers

GPTCoach: Towards LLM-Based Physical Activity Coaching

2024-05-09Citations: 11

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

2024-08-19arXiv.orgCitations: 0

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Engineer, Language - Personalization, Meta Superintelligence Labs