π€ AI Summary
This study addresses severe selection bias in user feedback during large language model deployment, where responses arise only from non-random, polarized users, causing naive averaging to deviate from true system quality by 40β50 percentage points. To tackle this, the authors propose the first three-tier Bayesian framework that jointly models topics and sentiment without requiring individual ground-truth labels, formulating the problem as an identifiable Bayesian inference task. By incorporating a feedback-channel prior, the approach resolves parameter non-identifiability and enables online drift detection and recalibration. The end-to-end multi-agent pipeline integrates UMAP+HDBSCAN clustering, a two-stage Beta-Binomial bias model, and topic-popularity reweighting. Evaluated on UltraFeedback, the method achieves estimation errors of only 4β13 percentage points under extreme bias ratios (1:1 to 30:1), with 95% credible intervals consistently covering the true values, substantially outperforming baseline approaches.
π Abstract
[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatΟ_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatΟ_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $ΞΊ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.