🤖 AI Summary
This study investigates whether tutoring solely through conversational large language models (LLMs) is sufficient to effectively support learning in mathematical proof construction, and compares its efficacy against embedded, structured feedback in fostering knowledge transfer. Using the GPTutor system, the research presents the first empirical comparison in a discrete mathematics course between students who received LLM assistance via conversational question-answering and those who interacted with LLM-generated annotations embedded directly within their proof-writing workspace. Integrating manual behavioral coding with automated classification of interaction logs, the findings reveal that frequent use of the chatbot—particularly when coupled with a tendency to seek direct answers—is significantly negatively associated with subsequent exam performance. In contrast, embedded feedback showed no such detrimental effect, suggesting it better supports sustainable learning outcomes.
📝 Abstract
We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.