Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

215K/year
πŸ€– AI Summary
This study addresses severe selection bias in user feedback during large language model deployment, where responses arise only from non-random, polarized users, causing naive averaging to deviate from true system quality by 40–50 percentage points. To tackle this, the authors propose the first three-tier Bayesian framework that jointly models topics and sentiment without requiring individual ground-truth labels, formulating the problem as an identifiable Bayesian inference task. By incorporating a feedback-channel prior, the approach resolves parameter non-identifiability and enables online drift detection and recalibration. The end-to-end multi-agent pipeline integrates UMAP+HDBSCAN clustering, a two-stage Beta-Binomial bias model, and topic-popularity reweighting. Evaluated on UltraFeedback, the method achieves estimation errors of only 4–13 percentage points under extreme bias ratios (1:1 to 30:1), with 95% credible intervals consistently covering the true values, substantially outperforming baseline approaches.
πŸ“ Abstract
[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatΟ€_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatΟ€_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $ΞΊ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.
Problem

Research questions and friction points this paper is trying to address.

selection bias
sparse user feedback
LLM quality estimation
sentiment stratification
topic stratification
Innovation

Methods, ideas, or system contributions that make the work stand out.

selection bias correction
hierarchical Bayesian modeling
topic-aware quality estimation
sparse user feedback
multi-agent pipeline