Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses severe selection bias in user feedback during large language model deployment, where responses arise only from non-random, polarized users, causing naive averaging to deviate from true system quality by 40–50 percentage points. To tackle this, the authors propose the first three-tier Bayesian framework that jointly models topics and sentiment without requiring individual ground-truth labels, formulating the problem as an identifiable Bayesian inference task. By incorporating a feedback-channel prior, the approach resolves parameter non-identifiability and enables online drift detection and recalibration. The end-to-end multi-agent pipeline integrates UMAP+HDBSCAN clustering, a two-stage Beta-Binomial bias model, and topic-popularity reweighting. Evaluated on UltraFeedback, the method achieves estimation errors of only 4–13 percentage points under extreme bias ratios (1:1 to 30:1), with 95% credible intervals consistently covering the true values, substantially outperforming baseline approaches.

📝 Abstract

[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatπ_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatπ_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $κ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

Problem

Research questions and friction points this paper is trying to address.

selection bias

sparse user feedback

LLM quality estimation

sentiment stratification

topic stratification

Innovation

Methods, ideas, or system contributions that make the work stand out.

selection bias correction

hierarchical Bayesian modeling

topic-aware quality estimation