Fine-Tuning Models for Automated Code Review Feedback

πŸ“… 2026-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This study addresses the limitations of open-source large language models in automatically generating high-quality code review feedback, particularly in comparison to closed-source counterparts. Focusing on Java error feedback generation in programming education, it presents the first systematic comparison between parameter-efficient fine-tuning (PEFT) and prompt engineering. Leveraging the Code Llama model and a high-quality feedback dataset, the authors conduct a multidimensional evaluation using BLEU, ROUGE, BERTScore, and human assessment. Results demonstrate that PEFT substantially outperforms prompt engineering, with student evaluations indicating that the generated feedback achieves quality comparable to that of ChatGPT. These findings validate the feasibility of deploying open-source models at scale in educational programming contexts.
πŸ“ Abstract
Large Language Models have introduced new possibilities for programming education through personalized support, content creation, and automated feedback. While recent studies have demonstrated the potential for feedback generation, many techniques rely on proprietary models, raising concerns about cost, computational demands, and the ethical implications of sharing student code. Open LLMs provide an alternative approach, but they do not currently have the capabilities of proprietary models. To address this problem, we investigate whether parameter-efficient fine-tuning (PEFT) and prompt engineering, both of which distil knowledge from a dataset derived from a large, more capable model, can be used to adapt and enhance the quality of feedback generated by the open LLM Code Llama. Feedback quality on buggy Java code was assessed using a combination of student evaluation, manual annotation and the automated metrics BLEU, ROUGE, and BERTScore. Our findings indicate that PEFT leads to notable improvements in feedback quality and significantly outperforms prompt engineering, providing an avenue for developing freely deployable feedback tools that can be effectively used to guide student learning. Student evaluation indicates that learners value the PEFT model's feedback and see it as being equally effective as the proprietary ChatGPT model. Participants suggested that incorporating additional explanation for technical terms in the PEFT model's feedback could be more beneficial. This study demonstrates that fine-tuned models can effectively support critical thinking and guide the design of scalable pedagogical systems.
Problem

Research questions and friction points this paper is trying to address.

Automated Code Review
Open LLMs
Feedback Generation
Programming Education
Model Fine-Tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-Efficient Fine-Tuning
Automated Code Review
Open LLMs
Feedback Generation
Programming Education