Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current evaluations of large language models (LLMs) predominantly emphasize response accuracy while overlooking the models’ capacity to anticipate subsequent user interactions. This work proposes generating the next turn of user utterance as a novel probing task to systematically assess LLMs’ interactive awareness for the first time. Through experiments on 11 open-source LLMs across five task categories—employing temperature sampling, controlled perturbations, and collaboration-oriented fine-tuning—we demonstrate a significant decoupling between task accuracy and interactive awareness. Notably, higher-temperature sampling boosts authentic follow-up rates by up to 22%, and targeted fine-tuning further enhances this capability. Our approach introduces a new dimension to model evaluation, addressing a critical gap in existing benchmarks.

Technology Category

Application Category

📝 Abstract

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

Problem

Research questions and friction points this paper is trying to address.

interaction awareness

user turn generation

language models

conversation modeling

assistant-only benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

user-turn generation

interaction awareness

language model evaluation