SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Interdisciplinary STEM education requires pedagogically intelligent systems capable of guiding students toward knowledge integration and transfer, yet no reliable benchmark exists to evaluate large language models’ (LLMs) instructional capabilities in this domain. Method: We introduce X-SRG, the first benchmark for assessing LLMs’ Socratic dialogue-based teaching competence in interdisciplinary STEM contexts. It comprises 10,000 multi-turn dialogues across 48 cross-disciplinary projects spanning physics, biology, engineering, and related fields. We propose a fine-grained annotation scheme grounded in educational measurement theory and employ a hybrid human–automated evaluation methodology. Contribution/Results: X-SRG quantifies model performance along three core pedagogical dimensions: conceptual linking, cognitive scaffolding, and transfer elicitation. Baseline experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in these higher-order instructional tasks, underscoring X-SRG’s critical role in advancing pedagogically grounded AI for STEM education.

Technology Category

Application Category

📝 Abstract
Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' guided instruction in STEM interdisciplinary dialogues
Lack of effective benchmark for higher-order guidance assessment
State-of-the-art LLMs struggle with knowledge integration and transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SID benchmark for LLM evaluation
Large-scale dataset with 10,000 dialogue turns
Novel annotation schema for pedagogical features
M
Mei Jiang
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China
H
Houping Yue
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China
Bingdong Li
Bingdong Li
East China Normal University
evolutionary computationmachine learningblack-box optimization
H
Hao Hao
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China
Y
Ying Qian
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China
B
Bo Jiang
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China
A
Aimin Zhou
Shanghai Institute of Al for Education, East China Normal University, Shanghai, China