🤖 AI Summary
This work addresses the long-standing lack of standardized evaluation tools for user simulators in interactive information retrieval, where behavioral fidelity and testing reliability are often conflated. The authors propose a unified conversational interaction framework that explicitly disentangles and quantifies these two dimensions, establishing three executable benchmarks. By introducing standardized conversational data schemas, dataset adapters, explicit loss computations, RATE-style estimators, and novel metrics—namely click-depth distance and Fréchet distance over session embeddings—they construct comprehensive baselines across four multilingual real-world datasets and four simulator types. Experimental results demonstrate that click-depth distance and Fréchet distance significantly predict system ranking effectiveness (|r| = 0.43 and 0.40, p ≤ 0.005), whereas conventional “human-likeness” metrics show negligible predictive power.
📝 Abstract
User simulators are increasingly central to interactive information retrieval, yet the community lacks standardized evaluation tools. Simulators serve two objectives, behavioral realism (matching real user behavior) and tester reliability (producing valid system rankings), and these are often conflated despite being distinct and sometimes conflicting. We present SimEval-IR, an open-source toolkit and benchmark suite that makes this distinction measurable. SimEval-IR provides: (1) a canonical session schema unifying session search and conversational interactions, with validated dataset adapters and explicit loss accounting; (2) three executable benchmarks covering behavioral realism, tester reliability with RATE-style estimation, and an analysis linking the two; and (3) baseline results across four real datasets in two languages and four simulator families. Our key finding: the classifier-discriminator ''human-likeness'' check, the dominant realism test in the literature, has essentially no pooled predictive power for system-ranking validity ($r{=}{+}0.09$, $n{=}48$), while marginal click-depth distance and Fréchet distance over session embeddings give a much stronger signal ($|r|{=}0.43$ and $0.40$, $p{\leq}0.005$). SimEval-IR is released with all configurations and scripts to reproduce the reported analysis.