BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of accurately evaluating user satisfaction in open-domain dialogue AI, where traditional A/B testing is hindered by sparse explicit feedback and ambiguous implicit signals. To overcome this, the authors propose the BoRP framework, which uniquely leverages the geometric structure of large language model (LLM) latent spaces together with a polarization-index-guided bootstrapping mechanism to automatically generate evaluation criteria—without requiring generative inference. By mapping hidden states to continuous satisfaction scores via partial least squares (PLS) regression, BoRP enables full-scale monitoring and highly sensitive A/B testing. Experiments on industrial datasets demonstrate that BoRP significantly outperforms generative baselines, including Qwen3-Max, achieves strong alignment with human judgments, and reduces inference costs by several orders of magnitude.

Technology Category

Application Category

📝 Abstract

Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

user satisfaction

A/B testing

open-ended assistants

evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapped Regression Probing

latent space geometry

Partial Least Squares

human-aligned evaluation

scalable LLM evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow