Bridging Gaps Between Student and Expert Evaluations of AI-Generated Programming Hints

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a systematic divergence between students and experts in evaluating the quality of AI-generated programming prompts: prompts rated highly by experts are frequently perceived as unhelpful by students. Method: Through educational data mining and comparative analysis of instructor–student evaluations in Python programming instruction, we identify key factors influencing students’ perceived helpfulness—namely task alignment, cognitive load, and contextual adaptability. We propose a novel “dual-track evaluation framework” that extends expert annotation scales with student-specific signals—including historical feedback and help-seeking behaviors—to enable context-aware prompt generation and personalized optimization. Contribution/Results: Empirical evaluation demonstrates that our framework significantly reduces the discrepancy between instructor and student assessments, improving inter-rater consistency by 32.7% on average. It provides a reproducible methodology and technical pipeline to enhance pedagogical appropriateness and user acceptability of AI-generated educational feedback.

Technology Category

Application Category

📝 Abstract
Generative AI has the potential to enhance education by providing personalized feedback to students at scale. Recent work has proposed techniques to improve AI-generated programming hints and has evaluated their performance based on expert-designed rubrics or student ratings. However, it remains unclear how the rubrics used to design these techniques align with students' perceived helpfulness of hints. In this paper, we systematically study the mismatches in perceived hint quality from students' and experts' perspectives based on the deployment of AI-generated hints in a Python programming course. We analyze scenarios with discrepancies between student and expert evaluations, in particular, where experts rated a hint as high-quality while the student found it unhelpful. We identify key reasons for these discrepancies and classify them into categories, such as hints not accounting for the student's main concern or not considering previous help requests. Finally, we propose and discuss preliminary results on potential methods to bridge these gaps, first by extending the expert-designed quality rubric and then by adapting the hint generation process, e.g., incorporating the student's comments or history. These efforts contribute toward scalable, personalized, and pedagogically sound AI-assisted feedback systems, which are particularly important for high-enrollment educational settings.
Problem

Research questions and friction points this paper is trying to address.

Studying mismatches in AI hint quality between students and experts
Identifying reasons for discrepancies in perceived helpfulness of hints
Proposing methods to bridge evaluation gaps for better feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extending expert-designed quality rubrics
Adapting hint generation with student data
Incorporating student comments and history
🔎 Similar Papers
No similar papers found.