An Empirical Study of Validating Synthetic Data for Formula Generation

📅 2024-07-15

🏛️ North American Chapter of the Association for Computational Linguistics

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Synthetic natural language descriptions generated by large language models (LLMs) are increasingly used to train spreadsheet formula generation models, yet their annotation quality—and its impact on downstream fine-tuning performance—remains poorly understood. Method: We propose a proxy-objective-based synthetic data validation framework that integrates multi-model comparison (two open-source and two closed-source LLMs) with rigorous formula–natural language alignment evaluation. Contribution/Results: Through systematic empirical analysis, we demonstrate for the first time that synthetic annotation quality critically influences fine-tuning efficacy. While high-quality sample filtering reduces dataset size, it unexpectedly enhances model generalization and reasoning capabilities on complex formulas. Our approach consistently improves fine-tuning performance across four state-of-the-art models, confirming that high-fidelity synthetic data is a key lever for improving robustness in formula generation systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.

Problem

Research questions and friction points this paper is trying to address.

Validating synthetic NL annotations for spreadsheet formula generation

Assessing impact of validation on LLM fine-tuning performance

Exploring trade-off between example difficulty and model capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for synthetic formula generation

Validating synthetic data with surrogate objectives

Pruning challenging examples improves model complexity

🔎 Similar Papers

FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning