๐ค AI Summary
Large language models (LLMs) frequently produce unreliable outputs, and existing uncertainty estimation methods lack statistical guarantees for reliably identifying incorrect answers. To address this, we propose a false discovery rate (FDR) control framework that formulates selective prediction as a decision problem with linear expectation constraints, deriving a sufficient condition for finite-sample FDR control via calibration sets. We further design an uncertainty-aware dual-model routing mechanism that intelligently allocates tasks across models while maintaining unified FDR guarantees. Our method performs offline calibration under the exchangeability assumption, balancing theoretical rigor with practical applicability. Experiments on multiple question-answering benchmarks demonstrate significant improvements in both FDR control accuracy and effective sample retention rate. Moreover, under strict FDR control, our approach substantially increases the number of correctly accepted answersโthereby enhancing reliability without sacrificing coverage.
๐ Abstract
Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.