🤖 AI Summary
Interdisciplinary STEM education requires pedagogically intelligent systems capable of guiding students toward knowledge integration and transfer, yet no reliable benchmark exists to evaluate large language models’ (LLMs) instructional capabilities in this domain. Method: We introduce X-SRG, the first benchmark for assessing LLMs’ Socratic dialogue-based teaching competence in interdisciplinary STEM contexts. It comprises 10,000 multi-turn dialogues across 48 cross-disciplinary projects spanning physics, biology, engineering, and related fields. We propose a fine-grained annotation scheme grounded in educational measurement theory and employ a hybrid human–automated evaluation methodology. Contribution/Results: X-SRG quantifies model performance along three core pedagogical dimensions: conceptual linking, cognitive scaffolding, and transfer elicitation. Baseline experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in these higher-order instructional tasks, underscoring X-SRG’s critical role in advancing pedagogically grounded AI for STEM education.
📝 Abstract
Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.