What's Wrong with Your Synthetic Tabular Data? Using Explainable AI to Evaluate Generative Models

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Evaluating synthetic tabular data quality remains challenging due to conflicting, non-explainable evaluation metrics. To address this, we propose an explainable AI (XAI)-based diagnostic framework: first, a binary classifier distinguishes real from synthetic samples; then, permutation feature importance, partial dependence plots, SHAP values, and counterfactual explanations are systematically integrated to localize root causes of distributional discrepancies—such as anomalous variable dependencies or missingness pattern biases—revealing structural flaws in generation. This is the first work to systematically apply XAI techniques to synthetic data quality assessment. Experiments on two benchmark datasets demonstrate that our framework uncovers critical generative defects—e.g., spurious correlations and biased missingness—missed by conventional metrics like Jensen–Shannon divergence and machine learning utility. It delivers actionable, attribution-aware diagnostics, thereby enhancing transparency and accelerating iterative refinement of synthetic data generators.

Technology Category

Application Category

📝 Abstract

Evaluating synthetic tabular data is challenging, since they can differ from the real data in so many ways. There exist numerous metrics of synthetic data quality, ranging from statistical distances to predictive performance, often providing conflicting results. Moreover, they fail to explain or pinpoint the specific weaknesses in the synthetic data. To address this, we apply explainable AI (XAI) techniques to a binary detection classifier trained to distinguish real from synthetic data. While the classifier identifies distributional differences, XAI concepts such as feature importance and feature effects, analyzed through methods like permutation feature importance, partial dependence plots, Shapley values and counterfactual explanations, reveal why synthetic data are distinguishable, highlighting inconsistencies, unrealistic dependencies, or missing patterns. This interpretability increases transparency in synthetic data evaluation and provides deeper insights beyond conventional metrics, helping diagnose and improve synthetic data quality. We apply our approach to two tabular datasets and generative models, showing that it uncovers issues overlooked by standard evaluation techniques.

Problem

Research questions and friction points this paper is trying to address.

Evaluating synthetic tabular data quality is challenging due to conflicting metrics

Existing methods fail to pinpoint specific weaknesses in synthetic data

Explainable AI techniques reveal why synthetic data differs from real data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses explainable AI for synthetic data evaluation

Applies feature importance and effects analysis

Identifies inconsistencies with XAI techniques

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)