Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large language models (LLMs) such as ChatGPT and Gemini incur high computational overhead and cost in programming education, while often over-intervening in student learning. Method: This study proposes a specialization pipeline for compact models targeting C-language compilation error understanding. Leveraging a high-quality, domain-specific dataset of 40,000 real student error instances, we perform supervised fine-tuning (SFT) on open-source models including Qwen3-4B and Llama-3.1-8B. Evaluation employs both expert assessment and LLM-as-judge methodologies. Results: The fine-tuned small models match the performance of large models in explanation accuracy and pedagogical appropriateness, while drastically reducing inference latency and deployment cost. Our core contribution demonstrates that lightweight models, when paired with high-fidelity domain data, can effectively power programming education tools—establishing a new paradigm for low-cost, interpretable, and controllable educational AI agents.

Technology Category

Application Category

📝 Abstract

Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.

Problem

Research questions and friction points this paper is trying to address.

Replacing proprietary LLMs with fine-tuned open-source models for education

Improving pedagogical quality of small models via supervised fine-tuning

Addressing cost and over-assistance issues of large models in teaching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Fine-Tuning enhances smaller open-source LLMs

Utilizes 40,000 C compiler error dataset for training

Dual evaluation with expert reviews and automated analysis

🔎 Similar Papers

No similar papers found.