Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Supervised data scarcity for low-resource programming languages—such as Excel formulas—severely limits the code generation capabilities of large language models. Method: We propose a teacher-model-based approach to synthesize textbook-grade function–formula pairs, enriched with natural language explanations. Leveraging a teacher–student knowledge distillation framework, our method automatically generates high-quality, diverse, and semantically grounded synthetic examples, replacing scarce real-world code–comment pairs. These structured synthetic data are then integrated via instruction tuning to enhance student model performance—without relying on external retrieval, unlike conventional RAG paradigms. Contribution/Results: Experiments on two Excel-specific question-answering benchmarks demonstrate that our method significantly outperforms both baseline models and standard RAG approaches. This validates the effectiveness and novelty of synthetically constructed textbook-style data for modeling low-resource programming languages.

Technology Category

Application Category

📝 Abstract

A key consideration when training an LLM is whether the target language is more or less resourced, whether this is English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consist of real program demonstrations coupled with human-written comments. Here we present novel approaches to the creation of such data for low resource programming languages. We generate fully-synthetic, textbook-quality demonstrations of common library functions in an example domain of Excel formulas, using a teacher model. We then finetune an underperforming student model, and show improvement on 2 question-answering datasets recast into the Excel domain. We show advantages of finetuning over standard, off-the-shelf RAG approaches, which can offer only modest improvement due to the unfamiliar target domain.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM generation for low-resource programming languages

Creating synthetic training data for Excel formulas

Enhancing performance via finetuning over RAG approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic textbook-quality Excel function demonstrations

Uses teacher model to create training data

Finetunes student model for Excel QA improvement

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?