Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

📅 2024-06-05

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

In imbalanced classification, scarcity of minority-class samples induces model bias and spurious correlations. Method: This paper proposes a novel synthetic oversampling paradigm leveraging large language models (LLMs), establishing the first theoretical framework for synthetic data in imbalanced learning. It rigorously quantifies performance gains, derives scaling laws linking synthetic sample size to model accuracy, and characterizes the capability boundary of Transformers for generating high-fidelity synthetic samples. Contribution/Results: Theoretically, the method provably enhances classification accuracy, robustness, and generalization. Empirically, LLM-generated samples effectively mitigate class bias and outperform conventional resampling techniques (e.g., SMOTE) across multiple benchmarks. This work delivers an interpretable, scalable, and LLM-driven solution for trustworthy imbalanced learning.

Technology Category

Application Category

📝 Abstract

Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.

Problem

Research questions and friction points this paper is trying to address.

Imbalanced Classification

Spurious Correlations

Machine Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Synthetic Data Generation

Imbalanced Data Correction

🔎 Similar Papers

No similar papers found.