🤖 AI Summary
This work addresses the lack of reliable statistical evaluation methods for generative models, which hinders the assessment of their generalization performance and the estimability of evaluation metrics from finite samples. The authors propose a theoretical framework that systematically analyzes the conditions under which common evaluation metrics are statistically estimable, distinguishing between test-class-based metrics and divergence-based metrics in finite-sample settings. Leveraging tools from integral probability metrics (IPMs), Rényi divergences, and fat-shattering dimension, they rigorously establish—for the first time—that IPMs induced by bounded test classes admit arbitrarily accurate estimation from finite samples, whereas KL and Rényi divergences, which depend on rare events, do not. This study provides a foundational theoretical basis and practical guidance for evaluating generative models.
📝 Abstract
Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples.
In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.