Flipping the Dialogue: Training and Evaluating User Language Models

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing large language models inadequately simulate authentic multi-turn human user behaviors—such as informal phrasing, personalized stylistic expression, and real-time self-correction—leading to biased and unrealistic evaluations of assistant models. Method: We propose User Language Models (User LMs), trained via human-centric post-training on multi-turn dialogue data—not by naïvely inverting assistant models—to explicitly capture realistic user interaction patterns. Contribution/Results: User LMs significantly improve behavioral fidelity and evaluation robustness, validated through both automated metrics and human assessments. Experiments reveal that when evaluated using User LMs, GPT-4o’s accuracy drops from 74.6% to 57.4% on programming and mathematical reasoning tasks, uncovering previously masked interaction weaknesses. This work establishes the first systematic, trustworthy user-side simulation paradigm for dialogue evaluation, introducing a new benchmark for assessing conversational capabilities of large models.

Technology Category

Application Category

📝 Abstract

Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.

Problem

Research questions and friction points this paper is trying to address.

Developing user language models for realistic conversation simulation

Addressing limitations of assistant LMs as poor user simulators

Evaluating assistant performance degradation in human-like dialogues

Innovation

Methods, ideas, or system contributions that make the work stand out.

User LMs simulate human users in conversations

Purpose-built models replace assistant LMs as simulators

User LMs improve simulation robustness and human alignment

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation