TAGAL: Tabular Data Generation using Agentic LLM Methods

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-quality synthetic tabular data—without fine-tuning large language models (LLMs)—to enhance downstream classification performance. We propose a zero-shot agent-based generation framework that orchestrates an intelligent agent workflow integrating external knowledge injection, data similarity assessment, and downstream utility feedback to enable iterative synthesis refinement. Our key contribution is the first unification of feedback-driven zero-shot generation, knowledge augmentation, and task-oriented validation within an agent architecture—entirely avoiding LLM parameter updates. Empirical evaluation across multiple benchmark datasets shows that classifiers trained solely on synthetic data—or hybrid real-synthetic data—achieve performance on par with state-of-the-art fine-tuning methods, significantly outperforming existing training-free approaches. Crucially, our method preserves statistical fidelity of the synthetic data while ensuring its task relevance and utility.

Technology Category

Application Category

📝 Abstract
The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic tabular data using agentic LLM methods
Improving machine learning performance without LLM training
Enhancing data quality with automatic iterative feedback processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic workflow for synthetic tabular data generation
LLM-based iterative process with feedback improvement
Training-free approach using external knowledge integration