The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the bias in offline evaluation of language models arising from confounding between user selection and output scoring in logged data. To achieve unbiased assessment, the authors propose a three-source fusion framework that integrates observational logs (OBS), randomized experiments (EXP), and an offline simulator (SIM), grounded in causal identification theory. Theoretically, they prove that only EXP and SIM are necessary to identify the true model value, while OBS serves solely to reduce variance rather than ensure unbiasedness. Through causal inference, semi-synthetic validation, and multi-source estimation, they systematically evaluate six classes of estimators on summarization and code generation tasks, revealing that estimator performance critically depends on the size of EXP samples and the alignment between the target reward and the structure of OBS.

📝 Abstract

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.

Problem

Research questions and friction points this paper is trying to address.

confounded model choice

offline evaluation

language model generation

causal inference

usage logs

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal evaluation

confounded model choice

three-source design