The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Offline evaluation fails to capture large language models’ (LLMs) behavior in real-world personalized settings, as it neglects the systematic influence of user interaction history on model outputs. Method: We conduct the first large-scale field study, comparing responses from 800 real ChatGPT and Gemini users to identical benchmark questions within stateful conversations. Contribution/Results: We empirically demonstrate substantial response variability across users for the same query—indicating severe misalignment between offline metrics and online performance. This reveals a structural impact of personalized context on LLM outputs, underscoring the necessity of incorporating user history into evaluation. We propose a hybrid assessment framework integrating offline benchmarking with online conversational log analysis. Our findings provide critical empirical evidence and methodological foundations for developing dynamic, user-aware evaluation paradigms that better reflect real-world deployment conditions.

Technology Category

Application Category

📝 Abstract

Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.

Problem

Research questions and friction points this paper is trying to address.

Offline LLM evaluations fail to capture real-world personalization effects

Identical prompts produce different responses across user chat sessions

Standard evaluations miss how personalization alters model behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalization-aware evaluation methodology

Comparative offline versus field analysis

Real user chat session experiments

🔎 Similar Papers

No similar papers found.

Authors to Follow