🤖 AI Summary
In imbalanced classification, scarcity of minority-class samples induces model bias and spurious correlations. Method: This paper proposes a novel synthetic oversampling paradigm leveraging large language models (LLMs), establishing the first theoretical framework for synthetic data in imbalanced learning. It rigorously quantifies performance gains, derives scaling laws linking synthetic sample size to model accuracy, and characterizes the capability boundary of Transformers for generating high-fidelity synthetic samples. Contribution/Results: Theoretically, the method provably enhances classification accuracy, robustness, and generalization. Empirically, LLM-generated samples effectively mitigate class bias and outperform conventional resampling techniques (e.g., SMOTE) across multiple benchmarks. This work delivers an interpretable, scalable, and LLM-driven solution for trustworthy imbalanced learning.
📝 Abstract
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.