Developing synthetic microdata through machine learning for firm-level business surveys

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional anonymization of enterprise-level commercial survey data faces significant re-identification risks and struggles to balance confidentiality with analytical utility. To address this, we propose a generative machine learning–based method for synthesizing microdata. Our approach integrates multidimensional distribution matching across geographic and industrial dimensions, employs domain-specific quality metrics, and enforces statistical moment constraints to ensure high-fidelity synthetic data that closely replicates key statistical properties and economic inference outcomes of the original dataset. We successfully generated a synthetic dataset for the 2007 Business Owner Survey and fully reproduced an empirical study published in *Small Business Economics*, thereby validating the method’s statistical validity and reproducibility. This work bridges a critical technical gap in secure commercial survey data dissemination and provides a scalable, regulatory-compliant framework for sharing sensitive microdata.

Technology Category

Application Category

📝 Abstract
Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.
Problem

Research questions and friction points this paper is trying to address.

Develop synthetic firm-level data to protect confidentiality
Address re-identification risks in business surveys using machine learning
Ensure synthetic data preserves statistical properties for analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning model generates synthetic firm-level data
Synthetic data preserves key statistical moments without real records
Quality metrics validate synthetic data against original survey
🔎 Similar Papers
J
Jorge Cisneros Paz
Oak Ridge Institute for Science & Education, USA
T
Timothy Wojan
Oak Ridge Institute for Science & Education, USA
Matthew Williams
Matthew Williams
RTI International
Bayesian statisticsprivacyoptimizationsurvey methodology
J
Jennifer Ozawa
RTI International, USA
R
Robert Chew
RTI International, USA
K
Kimberly Janda
RTI International, USA
T
Timothy Navarro
RTI International, USA
M
Michael Floyd
Knexus Research Corporation, USA
Christine Task
Christine Task
Knexus Research Corporation, USA
D
Damon Streat
Knexus Research Corporation, USA