PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work addresses the challenge of aligning large language models’ feedback on programming assignments with the stylistic preferences of individual instructors while preserving diagnostic accuracy. The authors propose PERSA, a framework that fine-tunes only the top Transformer layers through a combination of supervised fine-tuning, human-preference-based reward modeling, and proximal policy optimization (PPO) reinforcement learning. By focusing adaptation exclusively on style-relevant parameters, PERSA achieves strong alignment with target instructor styles without compromising content correctness. Evaluated on three benchmarks—including APPS—the approach yields style alignment scores of 96.2% for Llama-3 and Gemma-2 models while maintaining 100% diagnostic accuracy.
📝 Abstract
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
Problem

Research questions and friction points this paper is trying to address.

personalized feedback
style alignment
large language models
educational AI
instructor tone
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning from Human Feedback (RLHF)
Parameter-Efficient Fine-Tuning
Style Alignment
Personalized Educational Feedback
Transformer Internals
R
Ravi Ranjan
Florida International University (FIU), Miami, USA
U
Utkarsh Grover
University of South Florida (USF), Tampa, USA
Xiaomin Lin
Xiaomin Lin
Assistant Prof, University of South Florida
AI for goodRobotics for scienceRobotics for good
A
Agoritsa Polyzou
Florida International University (FIU), Miami, USA