Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the degradation of downstream classification performance caused by class imbalance—particularly the scarcity of samples from rare classes—in real-world relational tabular data. To tackle this challenge, the authors propose the RDDG framework, which introduces a self-reinforcing feedback mechanism into large language model–driven tabular data generation for the first time. RDDG integrates core-set selection, Bayesian calibration, and in-context learning to uncover attribute dependencies, and employs a progressive chain-of-thought strategy to generate structurally faithful synthetic data. Extensive experiments demonstrate that RDDG consistently outperforms existing methods across multiple real-world and synthetic datasets, achieving state-of-the-art results in both data fidelity and classification accuracy under severe class imbalance.

Technology Category

Application Category

📝 Abstract
Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.
Problem

Research questions and friction points this paper is trying to address.

imbalanced data
relational data synthesis
rare-class generation
tabular data
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

relational data synthesis
in-context learning
self-reinforcing feedback
imbalanced classification
chain-of-thought
🔎 Similar Papers
No similar papers found.