Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

📅 2024-07-29
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality instruction data and the prohibitive cost of manual construction in code generation, this paper proposes a tri-model co-evolutionary synthetic framework: an Instructor-LLM generates instructions, a Coder-LLM produces corresponding code, and a Judge-LLM automatically evaluates correctness; genetic operators—mutation, selection, and crossover—are applied to instructions. This work introduces the first Instructor-Coder-Judge co-evolution paradigm, enabling cold-start training with weak models and offering strong scalability and parallelism. Starting from only a small set of seed instructions, the framework efficiently synthesizes millions of high-quality instruction-code pairs. Experiments yield over 7.5 million samples; fine-tuning LLMs on this data significantly improves code generation performance, outperforming existing synthetic approaches and public datasets on benchmarks including HumanEval.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles. Starting from a small set of seed instructions, Genetic-Instruct generates diverse and challenging instruction-code pairs by leveraging an Instructor-LLM for generation, a Coder-LLM for code synthesis, and a Judge-LLM for automatic quality evaluation. Our proposed approach is highly parallelizable and effective even with a small seed data and weaker generator models. We generated more than 7.5 million coding instructions with the proposed approach. Then we evaluated it by fine-tuning LLMs with the synthetic samples and demonstrated a significant improvement in their code generation capability compared to the other synthetic generation approaches and publicly available datasets. Our results highlight the efficiency, scalability, and generalizability of the Genetic-Instruct framework.
Problem

Research questions and friction points this paper is trying to address.

Scalable synthesis of high-quality coding instructions for LLMs
Reducing reliance on expensive expert-curated datasets
Improving LLM code generation via evolutionary synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary algorithm synthesizes coding instructions
Uses LLMs for generation and quality evaluation
Scalable framework with minimal seed data
🔎 Similar Papers
No similar papers found.
Somshubra Majumdar
Somshubra Majumdar
NVIDIA
Machine LearningDeep LearningComputer VisionTime SeriesSpeech Recognition
V
V. Noroozi
NVIDIA, Santa Clara, CA 15213, USA
S
Sean Narenthiran
NVIDIA, Santa Clara, CA 15213, USA
A
Aleksander Ficek
NVIDIA, Santa Clara, CA 15213, USA
J
Jagadeesh Balam
NVIDIA, Santa Clara, CA 15213, USA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis