🤖 AI Summary
This study addresses the absence of comprehensive datasets capable of jointly evaluating large language models’ (LLMs’) competencies, biases, and psychological dimensions—such as math anxiety and self-efficacy—in mathematics education. To bridge this gap, the authors introduce the MEDS dataset, comprising 28,000 anthropomorphized “digital shadows” generated by 14 mainstream LLMs (e.g., Mistral, Qwen, DeepSeek) to simulate high school students’ behaviors, attitudes, and cognitive processes during mathematical tasks. Integrating psychometric assessments, open-ended interviews, high-cognition network modeling, and reasoning analysis, this work moves beyond traditional score-based evaluation paradigms. The study demonstrates that LLMs can maintain stable personality traits and reveals family-specific biases—including negative mathematical dispositions, logical fallacies, and overconfidence—thereby establishing a critical data foundation and evaluation framework for developing safe and trustworthy AI-driven educational systems.
📝 Abstract
To enhance LLMs' impact on math education, we need data on their mathematical prowess and biases across prompts. To fill this gap, we introduce MEDS (Math Education Digital Shadows) as a dataset mapping how large language models reason about and report mathematics across human- and AI-like conditions. MEDS involves 28,000 personas from 14 LLMs (from families like Mistral, Qwen, DeepSeek, Granite, Phi and Grok) shadowing either humans or AI assistants. Each record/shadow includes a set of prompts along with psychological/sociodemographic persona metadata and four types of math tasks: (i) open math interview, (ii) three psychometric tests about math perceptions with explanations, (iii) cognitive networks capturing math attitudes, and (iv) 18 high-school math test questions together with their reasoning and confidence scores. MEDS differs from traditional score-only math benchmarks because it integrates concepts of self-efficacy, math anxiety, and cognitive network science besides math proficiency scores. Data validation shows that the sampled LLMs exhibit schema integrity and consistent personas, together with family-specific peculiarities like human-like negative math attitudes, logical fallacies, and math overconfidence. MEDS will benefit learning analytics experts, cognitive scientists, and developers of safer AI tutors in mathematics.