Synthetic Data Blueprint (SDB): A modular framework for the statistical, structural, and graph-based evaluation of synthetic tabular data

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current evaluation of synthetic tabular data lacks a unified framework, resulting in heterogeneous metrics and fragmented practices. To address this, we propose the first three-dimensional fidelity assessment paradigm integrating statistical distribution alignment, variable dependency preservation, and graph-structured representation learning. We implement this as SynthEval—a modular Python library supporting automated data-type inference, cross-domain benchmarking, and end-to-end interactive visualization report generation. SynthEval integrates principled methods including Wasserstein distance for marginal distribution comparison, Hilbert–Schmidt Independence Criterion (HSIC) for dependency quantification, GNN-based embedding similarity for relational structure fidelity, and structural graph modeling. Built with Plotly and Seaborn, it enables interactive exploratory analysis. Empirical validation across three heterogeneous domains—medical diagnosis, socioeconomic modeling, and cybersecurity—demonstrates significant improvements in evaluation consistency, interpretability, and robustness, particularly for high-cardinality categorical variables and high-dimensional time-series signals.

Technology Category

Application Category

📝 Abstract

In the rapidly evolving era of Artificial Intelligence (AI), synthetic data are widely used to accelerate innovation while preserving privacy and enabling broader data accessibility. However, the evaluation of synthetic data remains fragmented across heterogeneous metrics, ad-hoc scripts, and incomplete reporting practices. To address this gap, we introduce Synthetic Data Blueprint (SDB), a modular Pythonic based library to quantitatively and visually assess the fidelity of synthetic tabular data. SDB supports: (i) automated feature-type detection, (ii) distributional and dependency-level fidelity metrics, (iii) graph- and embedding-based structure preservation scores, and (iv) a rich suite of data visualization schemas. To demonstrate the breadth, robustness, and domain-agnostic applicability of the SDB, we evaluated the framework across three real-world use cases that differ substantially in scale, feature composition, statistical complexity, and downstream analytical requirements. These include: (i) healthcare diagnostics, (ii) socioeconomic and financial modelling, and (iii) cybersecurity and network traffic analysis. These use cases reveal how SDB can address diverse data fidelity assessment challenges, varying from mixed-type clinical variables to high-cardinality categorical attributes and high-dimensional telemetry signals, while at the same time offering a consistent, transparent, and reproducible benchmarking across heterogeneous domains.

Problem

Research questions and friction points this paper is trying to address.

Evaluates synthetic tabular data fidelity using statistical and structural metrics.

Addresses fragmented evaluation with a modular library for consistent benchmarking.

Assesses data across diverse domains like healthcare, finance, and cybersecurity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular Python library for synthetic data evaluation

Automated feature detection and fidelity metrics

Graph-based structure preservation and visualization schemas

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models