Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

📅 2025-05-16

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing STEM education PBL (Project-Based Learning) evaluation benchmarks lack free-form output structures and rigorous expert validation; moreover, reliable automated solutions for teachers leveraging MLLMs to handle multimodal, long-context, knowledge-intensive pedagogical tasks remain scarce. Method: We introduce PBLBench—the first expert-validated, multimodal assessment benchmark specifically designed for STEM PBL. It innovatively integrates Analytic Hierarchy Process (AHP)-based weighting modeling with a complex reasoning evaluation paradigm grounded in authentic teacher workflows, emphasizing hallucination resistance and system stability. Contribution/Results: The benchmark empirically evaluates 15 mainstream MLLMs/LLMs; the highest rank accuracy achieved is only 59%, exposing critical limitations in current models’ educational complex-reasoning capabilities. PBLBench establishes a reproducible, verifiable evaluation infrastructure for AI teaching assistants.

Technology Category

Application Category

📝 Abstract

Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

Problem

Research questions and friction points this paper is trying to address.

Evaluating STEM education using multimodal data in project-based learning

Addressing unreliable MLLM outputs due to hallucination and instability

Developing rigorous benchmarks for complex educational reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PBLBench benchmark for STEM education evaluation

Uses Analytic Hierarchy Process for expert-driven criteria weighting

Assesses 15 MLLMs/LLMs on complex reasoning and long-context tasks

🔎 Similar Papers

No similar papers found.