The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline evaluation fails to capture large language models’ (LLMs) behavior in real-world personalized settings, as it neglects the systematic influence of user interaction history on model outputs. Method: We conduct the first large-scale field study, comparing responses from 800 real ChatGPT and Gemini users to identical benchmark questions within stateful conversations. Contribution/Results: We empirically demonstrate substantial response variability across users for the same query—indicating severe misalignment between offline metrics and online performance. This reveals a structural impact of personalized context on LLM outputs, underscoring the necessity of incorporating user history into evaluation. We propose a hybrid assessment framework integrating offline benchmarking with online conversational log analysis. Our findings provide critical empirical evidence and methodological foundations for developing dynamic, user-aware evaluation paradigms that better reflect real-world deployment conditions.

Technology Category

Application Category

📝 Abstract
Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.
Problem

Research questions and friction points this paper is trying to address.

Offline LLM evaluations fail to capture real-world personalization effects
Identical prompts produce different responses across user chat sessions
Standard evaluations miss how personalization alters model behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalization-aware evaluation methodology
Comparative offline versus field analysis
Real user chat session experiments
🔎 Similar Papers
No similar papers found.