Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates whether large language models (LLMs) can accurately predict the cognitive difficulty of educational items for human learners and align with human-perceived difficulty. Method: We develop a cross-domain, multi-model evaluation framework that integrates explicit proficiency simulation—prompting models to emulate learners’ ability levels—with large-scale human difficulty annotations across diverse domains (e.g., medicine, mathematics), empirically assessing over 20 LLMs. Contribution/Results: We report the first empirical evidence that scaling model size exacerbates human–model difficulty misalignment; high performance does not imply high pedagogical empathy. LLMs consistently lack metacognitive awareness of their own cognitive limitations and fail to reliably model learners’ ability boundaries. Consequently, current LLMs exhibit insufficient reliability for automated item difficulty prediction in cold-start educational settings, exposing a fundamental gap between “capability simulation” and “empathic understanding” in AI-driven education.

Technology Category

Application Category

📝 Abstract

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Problem

Research questions and friction points this paper is trying to address.

Assess whether large language models can estimate student cognitive struggles on questions.

Investigate model alignment with human difficulty perception versus machine consensus convergence.

Examine if high problem-solving capability enables accurate simulation of student proficiency limitations.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proficiency simulation for item difficulty prediction

Human-AI difficulty alignment with large-scale empirical analysis

Models converge toward machine consensus not human struggles

🔎 Similar Papers

No similar papers found.

Authors to Follow