Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the limitations of large language models (LLMs) as tutoring agents in accurately discriminating between high-quality, suboptimal, and erroneous reasoning steps in propositional logic tasks, revealing a systematic bias particularly in distinguishing suboptimal from incorrect reasoning. The authors construct a benchmark dataset comprising 10,836 problem-solving–feedback pairs, annotated with ground-truth labels derived from a knowledge graph, and evaluate the diagnostic capabilities of seven LLMs under three feedback conditions. Their analysis demonstrates that while LLMs approach theoretical upper bounds in identifying optimal reasoning steps, they struggle to translate accurate diagnoses into effective pedagogical feedback, indicating a disconnect between diagnostic accuracy and instructional efficacy. To bridge this gap, the work proposes a novel hybrid architecture that integrates the structured reasoning of knowledge graphs with the generative strengths of LLMs.

📝 Abstract

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

Problem

Research questions and friction points this paper is trying to address.

LLM tutoring

diagnostic precision

intelligent tutoring systems

feedback accuracy

adaptive tutoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM tutoring agents

diagnostic precision

knowledge graph