🤖 AI Summary
This work addresses the challenge of accurately evaluating user satisfaction in open-domain dialogue AI, where traditional A/B testing is hindered by sparse explicit feedback and ambiguous implicit signals. To overcome this, the authors propose the BoRP framework, which uniquely leverages the geometric structure of large language model (LLM) latent spaces together with a polarization-index-guided bootstrapping mechanism to automatically generate evaluation criteria—without requiring generative inference. By mapping hidden states to continuous satisfaction scores via partial least squares (PLS) regression, BoRP enables full-scale monitoring and highly sensitive A/B testing. Experiments on industrial datasets demonstrate that BoRP significantly outperforms generative baselines, including Qwen3-Max, achieves strong alignment with human judgments, and reduces inference costs by several orders of magnitude.
📝 Abstract
Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.