GLM Inference with AI-Generated Synthetic Data Using Misspecified Linear Regression

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses the degradation of estimator convergence rates and the failure of standard √n-consistency in statistical inference for generalized linear models (GLMs) under synthetic data. To resolve this, we propose a novel inference framework that requires only summary statistics of the original data—without access to individual-level records. Our method introduces a misspecified linear regression to correct synthetic-data-induced bias and incorporates asymptotic variance adjustment. Crucially, we provide rigorous theoretical analysis proving that the resulting estimator achieves √n-consistency and asymptotic normality—marking the first such result for GLMs under synthetic data. Empirical evaluation on canonical GLMs, including logistic regression, demonstrates that our approach improves confidence interval coverage by over 40% compared to existing methods, significantly enhancing inferential reliability.

Technology Category

Application Category

📝 Abstract

Privacy concerns in data analysis have led to the growing interest in synthetic data, which strives to preserve the statistical properties of the original dataset while ensuring privacy by excluding real records. Recent advances in deep neural networks and generative artificial intelligence have facilitated the generation of synthetic data. However, although prediction with synthetic data has been the focus of recent research, statistical inference with synthetic data remains underdeveloped. In particular, in many settings, including generalized linear models (GLMs), the estimator obtained using synthetic data converges much more slowly than in standard settings. To address these limitations, we propose a method that leverages summary statistics from the original data. Using a misspecified linear regression estimator, we then develop inference that greatly improves the convergence rate and restores the standard root-$n$ behavior for GLMs.

Problem

Research questions and friction points this paper is trying to address.

Address slow convergence of GLM estimators with synthetic data

Improve statistical inference using AI-generated synthetic datasets

Restore standard root-n behavior via misspecified linear regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses AI-generated synthetic data for GLM inference

Employs misspecified linear regression estimator

Improves convergence rate with summary statistics

🔎 Similar Papers

Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective