From Course to Skill: Evaluating LLM Performance in Curricular Analytics

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the reliability of large language models (LLMs) for skill extraction in curricular analytics, focusing on their capacity to process large-scale, unstructured course texts. We benchmark four approaches—retrieval-augmented generation (RAG), zero-shot prompting, TF-IDF matching, and BERT embedding similarity—on a corpus of 400 multi-source course documents. To ensure rigorous evaluation, we introduce the first human-in-the-loop assessment framework for this task. Results show that RAG consistently outperforms zero-shot prompting and traditional NLP methods across all document types; zero-shot prompting exhibits poor generalization; and both model selection and prompt engineering significantly impact extraction quality. This work represents the first systematic application of RAG to curricular analytics and empirically validates the efficacy of human-in-the-loop evaluation. It establishes a reproducible methodological foundation and provides empirical evidence to advance intelligent, data-driven educational analysis.

Technology Category

Application Category

📝 Abstract
Curricular analytics (CA) -- systematic analysis of curricula data to inform program and course refinement -- becomes an increasingly valuable tool to help institutions align academic offerings with evolving societal and economic demands. Large language models (LLMs) are promising for handling large-scale, unstructured curriculum data, but it remains uncertain how reliably LLMs can perform CA tasks. In this paper, we systematically evaluate four text alignment strategies based on LLMs or traditional NLP methods for skill extraction, a core task in CA. Using a stratified sample of 400 curriculum documents of different types and a human-LLM collaborative evaluation framework, we find that retrieval-augmented generation (RAG) to be the top-performing strategy across all types of curriculum documents, while zero-shot prompting performs worse than traditional NLP methods in most cases. Our findings highlight the promise of LLMs in analyzing brief and abstract curriculum documents, but also reveal that their performance can vary significantly depending on model selection and prompting strategies. This underscores the importance of carefully evaluating the performance of LLM-based strategies before large-scale deployment.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reliability in curricular analytics tasks
Comparing LLM and NLP methods for skill extraction
Assessing performance variability based on model and prompting strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses retrieval-augmented generation (RAG) for skill extraction
Evaluates four text alignment strategies systematically
Employs human-LLM collaborative evaluation framework
🔎 Similar Papers
No similar papers found.
Z
Zhen Xu
Columbia University, New York, NY, USA
X
Xinjin Li
Columbia University, New York, NY, USA
Y
Yingqi Huan
Columbia University, New York, NY, USA
V
Veronica Minaya
Columbia University, New York, NY, USA
Renzhe Yu
Renzhe Yu
Assistant Professor, Columbia University
Educational Data ScienceLearning AnalyticsComputational Social ScienceResponsible AI