🤖 AI Summary
Automated feedback in beginner programming education suffers from poor interpretability and limited pedagogical guidance. Method: This paper proposes a Socratic code feedback generation framework based on Reinforcement Learning from Human Feedback (RLHF), uniquely coupling RLHF with automated code evaluation to steer open-source large language models (e.g., Llama-3-7B, GPT-3.5) toward generating pedagogically effective questions and hints—not direct corrections. Contribution/Results: We introduce the first competition-grade benchmark tailored for programming education evaluation and integrate dual state-of-the-art optimization strategies—Proximal Policy Optimization (PPO) and Best-of-n sampling—alongside AI-generated feedback (RLAIF). Experiments show: (i) automated evaluation accuracy improves by 2–5% over non-RL baselines; (ii) human evaluation indicates nearly 40% improvement in feedback quality for GPT-3.5 under Best-of-n; and (iii) our approach achieves SOTA performance on both foundational and competitive programming datasets.
📝 Abstract
Automated Program Repair tools are developed for generating feedback and suggesting a repair method for erroneous code. State of the art (SOTA) code repair methods rely on data-driven approaches and often fail to deliver solution for complicated programming questions. To interpret the natural language of unprecedented programming problems, using Large Language Models (LLMs) for code-feedback generation is crucial. LLMs generate more comprehensible feedback than compiler-generated error messages, and Reinforcement Learning with Human Feedback (RLHF) further enhances quality by integrating human-in-the-loop which helps novice students to lean programming from scratch interactively. We are applying RLHF fine-tuning technique for an expected Socratic response such as a question with hint to solve the programming issue. We are proposing code feedback generation tool by fine-tuning LLM with RLHF, Automated Code Evaluation with RLHF (ACE-RLHF), combining two open-source LLM models with two different SOTA optimization techniques. The quality of feedback is evaluated on two benchmark datasets containing basic and competition-level programming questions where the later is proposed by us. We achieved 2-5% higher accuracy than RL-free SOTA techniques using Llama-3-7B-Proximal-policy optimization in automated evaluation and similar or slightly higher accuracy compared to reward model-free RL with AI Feedback (RLAIF). We achieved almost 40% higher accuracy with GPT-3.5 Best-of-n optimization while performing manual evaluation.