Exposing Diversity Bias in Deep Generative Models: Statistical Origins and Correction of Diversity Error

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses a pervasive bias in deep generative models—their systematic underestimation of data diversity after training. For the first time, the study identifies the entropy-based origin of this bias through the lens of finite-sample statistical properties. To quantify the diversity gap between generated samples and real data, the authors employ reference-free diversity metrics, including Vendi and RKE scores. Building on this insight, they propose a diversity-aware regularization and guidance strategy. Extensive experiments across multiple benchmark datasets demonstrate that the proposed approach significantly enhances the diversity of generated samples and effectively mitigates the issue of diversity collapse.

Technology Category

Application Category

📝 Abstract

Deep generative models have achieved great success in producing high-quality samples, making them a central tool across machine learning applications. Beyond sample quality, an important yet less systematically studied question is whether trained generative models faithfully capture the diversity of the underlying data distribution. In this work, we address this question by directly comparing the diversity of samples generated by state-of-the-art models with that of test samples drawn from the target data distribution, using recently proposed reference-free entropy-based diversity scores, Vendi and RKE. Across multiple benchmark datasets, we find that test data consistently attains substantially higher Vendi and RKE diversity scores than the generated samples, suggesting a systematic downward diversity bias in modern generative models. To understand the origin of this bias, we analyze the finite-sample behavior of entropy-based diversity scores and show that their expected values increase with sample size, implying that diversity estimated from finite training sets could inherently underestimate the diversity of the true distribution. As a result, optimizing the generators to minimize divergence to empirical data distributions would induce a loss of diversity. Finally, we discuss potential diversity-aware regularization and guidance strategies based on Vendi and RKE as principled directions for mitigating this bias, and provide empirical evidence suggesting their potential to improve the results.

Problem

Research questions and friction points this paper is trying to address.

diversity bias

deep generative models

sample diversity

entropy-based diversity scores

distribution fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

diversity bias

generative models

entropy-based diversity