🤖 AI Summary
To address the degradation in large language model (LLM) performance during real-world tabular data generation—caused by semantically impoverished column names—this paper proposes a domain-knowledge-enhanced prompt engineering framework for the GReaT model. We systematically design three novel prompt protocols: expert-guided, LLM-guided, and novel mapping-based, explicitly injecting domain knowledge into the generative process. Through multi-strategy prompt design and rigorous empirical comparison, we demonstrate that semantic-enriched prompts simultaneously improve synthetic data quality—including column-wise distribution fidelity and row-level logical coherence—and accelerate training: convergence speed increases significantly, reducing required iterations by over 30%. This work establishes a reproducible methodology and empirically validated foundation for prompt optimization in structured data generation, advancing the integration of domain semantics into LLM-based tabular synthesis.
📝 Abstract
LLM-based data generation for real-world tabular data can be challenged by the lack of sufficient semantic context in feature names used to describe columns. We hypothesize that enriching prompts with domain-specific insights can improve both the quality and efficiency of data generation. To test this hypothesis, we explore three prompt construction protocols: Expert-guided, LLM-guided, and Novel-Mapping. Through empirical studies with the recently proposed GReaT framework, we find that context-enriched prompts lead to significantly improved data generation quality and training efficiency.