Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

๐Ÿ“… 2025-05-16
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 5
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing STEM education PBL (Project-Based Learning) evaluation benchmarks lack free-form output structures and rigorous expert validation; moreover, reliable automated solutions for teachers leveraging MLLMs to handle multimodal, long-context, knowledge-intensive pedagogical tasks remain scarce. Method: We introduce PBLBenchโ€”the first expert-validated, multimodal assessment benchmark specifically designed for STEM PBL. It innovatively integrates Analytic Hierarchy Process (AHP)-based weighting modeling with a complex reasoning evaluation paradigm grounded in authentic teacher workflows, emphasizing hallucination resistance and system stability. Contribution/Results: The benchmark empirically evaluates 15 mainstream MLLMs/LLMs; the highest rank accuracy achieved is only 59%, exposing critical limitations in current modelsโ€™ educational complex-reasoning capabilities. PBLBench establishes a reproducible, verifiable evaluation infrastructure for AI teaching assistants.

Technology Category

Application Category

๐Ÿ“ Abstract
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
Problem

Research questions and friction points this paper is trying to address.

Evaluating STEM education using multimodal data in project-based learning
Addressing unreliable MLLM outputs due to hallucination and instability
Developing rigorous benchmarks for complex educational reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PBLBench benchmark for STEM education evaluation
Uses Analytic Hierarchy Process for expert-driven criteria weighting
Assesses 15 MLLMs/LLMs on complex reasoning and long-context tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xinyi Wu
Shanghai Jiao Tong University, Shanghai, China.
Yanhao Jia
Yanhao Jia
Nanyang Technological University
Artificial IntelligenceDeep LearningComputational Neuroscience
Q
Qinglin Zhang
Shanghai Jiao Tong University, Shanghai, China.
Y
Yiran Qin
Shanghai AI Laboratory, Shanghai, China.
Luwei Xiao
Luwei Xiao
Nanyang Technological University
LLMsMultimodal InteractionSentiment AnalysisHuman-in-the-loopAI for Healthcare
S
Shuai Zhao
Nanyang Technological University, Singapore.