Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

๐Ÿ“… 2026-05-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

185K/year
๐Ÿค– AI Summary
This study addresses the limitations of large language models (LLMs) as tutoring agents in accurately discriminating between high-quality, suboptimal, and erroneous reasoning steps in propositional logic tasks, revealing a systematic bias particularly in distinguishing suboptimal from incorrect reasoning. The authors construct a benchmark dataset comprising 10,836 problem-solvingโ€“feedback pairs, annotated with ground-truth labels derived from a knowledge graph, and evaluate the diagnostic capabilities of seven LLMs under three feedback conditions. Their analysis demonstrates that while LLMs approach theoretical upper bounds in identifying optimal reasoning steps, they struggle to translate accurate diagnoses into effective pedagogical feedback, indicating a disconnect between diagnostic accuracy and instructional efficacy. To bridge this gap, the work proposes a novel hybrid architecture that integrates the structured reasoning of knowledge graphs with the generative strengths of LLMs.
๐Ÿ“ Abstract
Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
Problem

Research questions and friction points this paper is trying to address.

LLM tutoring
diagnostic precision
intelligent tutoring systems
feedback accuracy
adaptive tutoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM tutoring agents
diagnostic precision
knowledge graph
adaptive tutoring
pedagogical feedback
๐Ÿ”Ž Similar Papers
No similar papers found.