TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing LLM evaluation frameworks overemphasize solution correctness while neglecting critical pedagogical interaction capabilities—such as tutoring or learning behaviors—required in Intelligent Tutoring Systems (ITS). Method: We propose TutorGym, the first standardized bidirectional integration platform enabling AI agents (e.g., LLMs) to interoperate with mature ITS platforms (e.g., CTAT, OATutors), supporting concurrent evaluation of both tutor roles (e.g., prompting, feedback generation, example provision) and learner roles (e.g., response to guidance, learning trajectory generation). Contribution/Results: TutorGym introduces two novel evaluation mechanisms: human learning curve alignment and step-level behavioral trajectory analysis. Experiments reveal that current LLMs perform poorly as tutors (52–70% accuracy), with error detection no better than random chance; yet as learners, their learning curves closely mirror human patterns—demonstrating the platform’s sensitivity and validity for pedagogically grounded LLM assessment.

Technology Category

Application Category

📝 Abstract

Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than evaluations of final solution generation. We introduce TutorGym to evaluate these applications more directly. TutorGym is a standard interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including Cognitive Tutors (CTAT), Apprentice Tutors, and OATutors. TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring support -- such as generating examples, hints, and step-level correctness feedback -- which can be evaluated directly against the adaptive step-by-step support provided by existing ITSs. As students, agents directly learn from ITS instruction, and their mistakes and learning trajectories can be compared to student data. TutorGym establishes a common framework for training and evaluating diverse AI agents, including LLMs, computational models of learning, and reinforcement learning agents, within a growing suite of learning environments. Currently, TutorGym includes 223 different tutor domains. In an initial evaluation, we find that current LLMs are poor at tutoring -- none did better than chance at labeling incorrect actions, and next-step actions were correct only ~52-70% of the time -- but they could produce remarkably human-like learning curves when trained as students with in-context learning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents as tutors and students in interactive learning environments

Assessing AI-generated tutoring support against existing intelligent tutoring systems

Comparing AI learning trajectories with human student data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standard interface for AI in tutoring systems

Evaluates AI as tutors and students interactively

Supports diverse AI agents in learning environments

🔎 Similar Papers

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback