🤖 AI Summary
Quantifying distributional discrepancies between complex machine learning system simulators and their real-world counterparts remains challenging due to the absence of tractable, model-free metrics.
Method: We propose a model-agnostic, black-box quantile curve assessment method that directly estimates differences between the quantile functions of simulator and ground-truth output distributions—without parametric assumptions on underlying data distributions. The approach enables confidence interval construction and risk quantification (e.g., VaR, CVaR) even under unknown deployment scenarios.
Contribution/Results: Unlike conventional methods, ours unifies evaluation across Bernoulli, categorical, and continuous vector-valued outputs, focusing explicitly on output uncertainty modeling. Evaluated on the WorldValueBench benchmark, it successfully quantifies simulation fidelity for four large language models, enabling cross-model performance comparison and risk-sensitive analysis. This establishes a novel paradigm for trustworthy evaluation of large-scale AI systems.
📝 Abstract
Simulation of complex systems originated in manufacturing and queuing applications. It is now widely used for large-scale, ML-based systems in research, education, and consumer surveys. However, characterizing the discrepancy between simulators and ground truth remains challenging for increasingly complex, machine-learning-based systems. We propose a computationally tractable method to estimate the quantile function of the discrepancy between the simulated and ground-truth outcome distributions. Our approach focuses on output uncertainty and treats the simulator as a black box, imposing no modeling assumptions on its internals, and hence applies broadly across many parameter families, from Bernoulli and multinomial models to continuous, vector-valued settings. The resulting quantile curve supports confidence interval construction for unseen scenarios, risk-aware summaries of sim-to-real discrepancy (e.g., VaR/CVaR), and comparison of simulators' performance. We demonstrate our methodology in an application assessing LLM simulation fidelity on the WorldValueBench dataset spanning four LLMs.