🤖 AI Summary
Current evaluation of synthetic tabular data lacks a unified framework, resulting in heterogeneous metrics and fragmented practices. To address this, we propose the first three-dimensional fidelity assessment paradigm integrating statistical distribution alignment, variable dependency preservation, and graph-structured representation learning. We implement this as SynthEval—a modular Python library supporting automated data-type inference, cross-domain benchmarking, and end-to-end interactive visualization report generation. SynthEval integrates principled methods including Wasserstein distance for marginal distribution comparison, Hilbert–Schmidt Independence Criterion (HSIC) for dependency quantification, GNN-based embedding similarity for relational structure fidelity, and structural graph modeling. Built with Plotly and Seaborn, it enables interactive exploratory analysis. Empirical validation across three heterogeneous domains—medical diagnosis, socioeconomic modeling, and cybersecurity—demonstrates significant improvements in evaluation consistency, interpretability, and robustness, particularly for high-cardinality categorical variables and high-dimensional time-series signals.
📝 Abstract
In the rapidly evolving era of Artificial Intelligence (AI), synthetic data are widely used to accelerate innovation while preserving privacy and enabling broader data accessibility. However, the evaluation of synthetic data remains fragmented across heterogeneous metrics, ad-hoc scripts, and incomplete reporting practices. To address this gap, we introduce Synthetic Data Blueprint (SDB), a modular Pythonic based library to quantitatively and visually assess the fidelity of synthetic tabular data. SDB supports: (i) automated feature-type detection, (ii) distributional and dependency-level fidelity metrics, (iii) graph- and embedding-based structure preservation scores, and (iv) a rich suite of data visualization schemas. To demonstrate the breadth, robustness, and domain-agnostic applicability of the SDB, we evaluated the framework across three real-world use cases that differ substantially in scale, feature composition, statistical complexity, and downstream analytical requirements. These include: (i) healthcare diagnostics, (ii) socioeconomic and financial modelling, and (iii) cybersecurity and network traffic analysis. These use cases reveal how SDB can address diverse data fidelity assessment challenges, varying from mixed-type clinical variables to high-cardinality categorical attributes and high-dimensional telemetry signals, while at the same time offering a consistent, transparent, and reproducible benchmarking across heterogeneous domains.