🤖 AI Summary
Offline evaluation fails to capture large language models’ (LLMs) behavior in real-world personalized settings, as it neglects the systematic influence of user interaction history on model outputs. Method: We conduct the first large-scale field study, comparing responses from 800 real ChatGPT and Gemini users to identical benchmark questions within stateful conversations. Contribution/Results: We empirically demonstrate substantial response variability across users for the same query—indicating severe misalignment between offline metrics and online performance. This reveals a structural impact of personalized context on LLM outputs, underscoring the necessity of incorporating user history into evaluation. We propose a hybrid assessment framework integrating offline benchmarking with online conversational log analysis. Our findings provide critical empirical evidence and methodological foundations for developing dynamic, user-aware evaluation paradigms that better reflect real-world deployment conditions.
📝 Abstract
Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.