The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

In supervised classification, encoder-decoder models (e.g., BERT, RoBERTa) suffer performance degradation under extreme few-shot settings due to severe scarcity of minority-class samples. Method: We propose “synthetic imputation”—a paradigm that generates semantically faithful and lexically novel class-level synthetic data using only five real examples per class, leveraging GPT-4o. Our approach integrates controllable prompt engineering, stochastic resampling, semantic consistency constraints, and an overfitting-controllable calibration mechanism—enabling, for the first time, generative large language models to perform semantics-preserving data imputation in ultra-low-resource regimes. Contribution/Results: Experiments show that models trained on synthetic data match full-data performance when ≥75 real samples are available; with only 50 real samples, synthetic augmentation induces low, predictable overfitting—effectively mitigated by our calibration. This work establishes a highly efficient and reliable data augmentation pathway for low-resource classification.

Technology Category

Application Category

📝 Abstract

Encoder-decoder Large Language Models (LLMs), such as BERT and RoBERTa, require that all categories in an annotation task be sufficiently represented in the training data for optimal performance. However, it is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. In this article, I describe this problem and propose a solution, the synthetic imputation approach. Leveraging a generative LLM (GPT-4o), this approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. This approach ensures that new synthetic texts are sufficiently different from the original texts to reduce overfitting, but retain the underlying substantive meaning of the examples to maximize out-of-sample performance. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples. The synthetic imputation approach provides a novel role for generative LLMs in research and allows applied researchers to balance their datasets for best performance.

Problem

Research questions and friction points this paper is trying to address.

Addressing underrepresented categories in classification training data

Generating synthetic texts to enhance model performance

Reducing overfitting while maintaining substantive meaning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic texts using GPT-4o

Ensures diversity to reduce overfitting

Balances datasets with few original samples

🔎 Similar Papers

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance