š¤ AI Summary
Current large language models struggle to authentically simulate the trial-and-error processes, iterative refinement, and cognitive fluctuations characteristic of novice learners in open-ended programming tasks. This work proposes a neuro-symbolic framework grounded in self-regulated learning theory, which models behavioral sequences via a semi-Markov process, incorporates explicit knowledge gaps through Bayesian knowledge tracing, and employs a decoupled agent architecture that separates policy reasoning from code generation to preserve plausible learning errors. Evaluated on Python programming tasks, the approach significantly outperforms existing baselines. In a human Turing test (N=71), participants distinguished model-generated trajectories from real student data at chance level (accuracy: 52.8%, dā²=0.15), indicating statistical indistinguishability.
š Abstract
Simulating student learning behaviors in open-ended problem-solving environments holds potential for education research, from training adaptive tutoring systems to stress-testing pedagogical interventions. However, collecting authentic data is challenging due to privacy concerns and the high cost of longitudinal studies. While Large Language Models (LLMs) offer a promising path to student simulation, they suffer from competency bias, optimizing for efficient correctness rather than the erratic, iterative struggle characteristic of novice learners. We present BEAGLE, a neuro-symbolic framework that addresses this bias by incorporating Self-Regulated Learning (SRL) theory into a novel architecture. BEAGLE integrates three key technical innovations: (1) a semi-Markov model that governs the timing and transitions of cognitive behaviors and metacognitive behaviors; (2) Bayesian Knowledge Tracing with explicit flaw injection to enforce realistic knowledge gaps and"unknown unknowns"; and (3) a decoupled agent design that separates high-level strategy use from code generation actions to prevent the model from silently correcting its own intentional errors. In evaluations on Python programming tasks, BEAGLE significantly outperforms state-of-the-art baselines in reproducing authentic trajectories. In a human Turing test, participants could not reliably tell BEAGLE traces apart from real student data: classification accuracy was statistically equivalent to chance (52.8%, d'= 0.15, N = 71)