๐ค AI Summary
This paper identifies and systematically investigates a previously unrecognized preference bias in large language models (LLMs)โtermed โuser-assistant biasโโwhere LLMs exhibit systematic asymmetry in favoring either user or assistant utterances during multi-turn dialogues. We formally define this bias and introduce UserAssist, a benchmark dataset comprising over 8,000 dialogue turns. Through controlled ablation studies, we find that instruction tuning and human preference alignment amplify user bias, whereas reasoning-oriented training mitigates it. To address this, we propose a bidirectional bias control framework based on Direct Preference Optimization (DPO). Evaluating 52 mainstream models, we observe that commercial models exhibit strong user bias, while reasoning-focused models show significantly reduced bias. Our DPO-based method enables precise, direction-specific bias calibration on training data and demonstrates robust cross-domain generalization.
๐ Abstract
Large language models (LLMs) can bias towards relying on their own or the user's information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $ extbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $ extbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $ extbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.