🤖 AI Summary
Supervised data scarcity for low-resource programming languages—such as Excel formulas—severely limits the code generation capabilities of large language models.
Method: We propose a teacher-model-based approach to synthesize textbook-grade function–formula pairs, enriched with natural language explanations. Leveraging a teacher–student knowledge distillation framework, our method automatically generates high-quality, diverse, and semantically grounded synthetic examples, replacing scarce real-world code–comment pairs. These structured synthetic data are then integrated via instruction tuning to enhance student model performance—without relying on external retrieval, unlike conventional RAG paradigms.
Contribution/Results: Experiments on two Excel-specific question-answering benchmarks demonstrate that our method significantly outperforms both baseline models and standard RAG approaches. This validates the effectiveness and novelty of synthetically constructed textbook-style data for modeling low-resource programming languages.
📝 Abstract
A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consist of real program demonstrations coupled with human-written comments. Here we present novel approaches to the creation of such data for low resource programming languages. We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model. We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.