🤖 AI Summary
Large language models (LLMs) for code underperform on practical software engineering tasks such as unit test generation, primarily due to training data skew toward high-frequency tasks like code completion and severe scarcity of high-quality, task-aligned data—especially for low-resource languages like Go.
Method: We introduce GoTestBench, the first dedicated benchmark for Go unit test generation, comprising 5,264 real-world code–test pairs from 10 open-source projects. Leveraging this benchmark, we systematically evaluate and fine-tune two architectures—mixture-of-experts and dense decoders—on the unit test generation task.
Contribution/Results: Our work is the first to simultaneously address the evaluation and data gaps for unit test generation in low-resource programming languages. Fine-tuned models achieve significant improvements over base models on over 75% of test cases. By mitigating data imbalance and enhancing task alignment, our approach substantially improves model practicality and generalization in real-world development scenarios.
📝 Abstract
Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.