🤖 AI Summary
Existing hybrid inference methods support only mean estimation or single quantile estimation, failing to meet domain-specific demands for fine-grained distributional characterization—such as tail risk and interquartile range. This paper introduces QuEst, the first framework enabling multi-quantile (including extreme quantiles) and multivariate hybrid inference. QuEst integrates a small number of high-fidelity observations with abundant model predictions, jointly leveraging quantile regression and variance-reduction optimization to deliver simultaneous point estimates and statistically rigorous confidence intervals. Compared to conventional approaches, QuEst substantially improves estimation accuracy and statistical validity. Empirical validation across economic modeling, opinion polling, and self-assessment of large language models confirms its effectiveness. QuEst establishes a general, scalable paradigm for reliable distributional inference of probabilistic metrics.
📝 Abstract
As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.