Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Synthesizing medical tabular data while simultaneously preserving privacy, utility, and fairness remains a significant challenge. This work proposes a goal-driven, unified workflow that leverages a local large language model to interpret user-provided natural language instructions and automatically orchestrate the generation and multidimensional evaluation of synthetic data using state-of-the-art synthesizers such as CTGAN, TVAE, and GaussianCopula. By replacing conventional parameter tuning with an interactive agent, the framework substantially enhances both automation and user control. Experimental results on an open-source schizophrenia dataset demonstrate that multiple synthesizers achieve comparable performance in terms of utility and fairness, thereby validating the flexibility and effectiveness of the proposed approach.

📝 Abstract

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

Problem

Research questions and friction points this paper is trying to address.

synthetic data

healthcare

privacy

fairness

utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data

large language models

tabular health data