Developing synthetic microdata through machine learning for firm-level business surveys

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Traditional anonymization of enterprise-level commercial survey data faces significant re-identification risks and struggles to balance confidentiality with analytical utility. To address this, we propose a generative machine learning–based method for synthesizing microdata. Our approach integrates multidimensional distribution matching across geographic and industrial dimensions, employs domain-specific quality metrics, and enforces statistical moment constraints to ensure high-fidelity synthetic data that closely replicates key statistical properties and economic inference outcomes of the original dataset. We successfully generated a synthetic dataset for the 2007 Business Owner Survey and fully reproduced an empirical study published in *Small Business Economics*, thereby validating the method’s statistical validity and reproducibility. This work bridges a critical technical gap in secure commercial survey data dissemination and provides a scalable, regulatory-compliant framework for sharing sensitive microdata.

Technology Category

Application Category

📝 Abstract

Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

Problem

Research questions and friction points this paper is trying to address.

Develop synthetic firm-level data to protect confidentiality

Address re-identification risks in business surveys using machine learning

Ensure synthetic data preserves statistical properties for analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning model generates synthetic firm-level data

Synthetic data preserves key statistical moments without real records

Quality metrics validate synthetic data against original survey

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review