🤖 AI Summary
A robust, general-purpose benchmark for evaluating AI in educational contexts remains lacking.
Method: This study introduces the first teaching-practice-oriented blind-evaluation arena for AI models, simulating authentic learning scenarios via role-playing by 189 frontline teachers and employing multi-round double-blind assessments by 206 education experts. It systematically evaluates pedagogical support efficacy of leading models—including Gemini 2.5 Pro—using a novel “teaching role-play + expert blind evaluation” paradigm. The framework moves beyond static benchmarks by emphasizing learning objective attainment and pedagogical principle implementation, integrating multi-turn human-AI interaction experiments, quantified educational metrics, and cross-model comparative protocols.
Contribution/Results: The proposed paradigm enables fine-grained, context-sensitive assessment grounded in educational theory. Results show Gemini 2.5 Pro achieves a 73.2% win rate—the highest among evaluated models—and significantly outperforms Claude 3.7 Sonnet, GPT-4o, and o3 on core pedagogical dimensions.
📝 Abstract
Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an"arena for learning"where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, $N = 189$ educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which $N = 206$ experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.