SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Interdisciplinary STEM education requires pedagogically intelligent systems capable of guiding students toward knowledge integration and transfer, yet no reliable benchmark exists to evaluate large language models’ (LLMs) instructional capabilities in this domain. Method: We introduce X-SRG, the first benchmark for assessing LLMs’ Socratic dialogue-based teaching competence in interdisciplinary STEM contexts. It comprises 10,000 multi-turn dialogues across 48 cross-disciplinary projects spanning physics, biology, engineering, and related fields. We propose a fine-grained annotation scheme grounded in educational measurement theory and employ a hybrid human–automated evaluation methodology. Contribution/Results: X-SRG quantifies model performance along three core pedagogical dimensions: conceptual linking, cognitive scaffolding, and transfer elicitation. Baseline experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in these higher-order instructional tasks, underscoring X-SRG’s critical role in advancing pedagogically grounded AI for STEM education.

Technology Category

Application Category

📝 Abstract

Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art LLMs struggle to execute effective guided dialogues that lead students to achieve knowledge integration and transfer. This highlights the critical value of our benchmark in driving the development of more pedagogically-aware LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' guided instruction in STEM interdisciplinary dialogues

Lack of effective benchmark for higher-order guidance assessment

State-of-the-art LLMs struggle with knowledge integration and transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SID benchmark for LLM evaluation

Large-scale dataset with 10,000 dialogue turns

Novel annotation schema for pedagogical features

🔎 Similar Papers

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach