🤖 AI Summary
Synthetic natural language descriptions generated by large language models (LLMs) are increasingly used to train spreadsheet formula generation models, yet their annotation quality—and its impact on downstream fine-tuning performance—remains poorly understood.
Method: We propose a proxy-objective-based synthetic data validation framework that integrates multi-model comparison (two open-source and two closed-source LLMs) with rigorous formula–natural language alignment evaluation.
Contribution/Results: Through systematic empirical analysis, we demonstrate for the first time that synthetic annotation quality critically influences fine-tuning efficacy. While high-quality sample filtering reduces dataset size, it unexpectedly enhances model generalization and reasoning capabilities on complex formulas. Our approach consistently improves fine-tuning performance across four state-of-the-art models, confirming that high-fidelity synthetic data is a key lever for improving robustness in formula generation systems.
📝 Abstract
Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.