TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current language models are predominantly trained and evaluated on single-turn dialogues, which inadequately capture or enhance their multi-turn conversational capabilities. To address this limitation, this work introduces TurnWiseEval—the first benchmark enabling direct comparability between single-turn and multi-turn performance—and TurnWiseData, a scalable pipeline for synthesizing multi-turn dialogue data. By integrating a post-training fine-tuning strategy based on OLMo-3, the approach achieves a 12% performance gain on TurnWiseEval using only 10,000 synthetically generated multi-turn dialogues, substantially narrowing the gap between single-turn and multi-turn competencies.

Technology Category

Application Category

📝 Abstract
Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.
Problem

Research questions and friction points this paper is trying to address.

multi-turn conversation
single-turn evaluation
language model capability gap
conversational AI
dialogue systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn dialogue
benchmarking
synthetic data generation
language model evaluation
post-training
🔎 Similar Papers
No similar papers found.