Toward LLM-Supported Automated Assessment of Critical Thinking Subskills

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study addresses the challenge of automating fine-grained assessment of critical thinking subskills in argumentative writing. We propose the first operational, granular coding rubric for these subskills and construct a manually annotated corpus. Methodologically, we empirically evaluate zero-shot prompting, few-shot prompting, and supervised fine-tuning across three models—GPT-5, GPT-5-mini, and ModernBERT. Our key contributions are threefold: (1) the first scalable, automated measurement framework for multidimensional critical thinking subskills; (2) empirical evidence that proprietary models (e.g., GPT-5) excel at recognizing high-frequency, well-defined subskills under few-shot settings, whereas open-weight models demonstrate superior performance on fine-grained and rare categories; and (3) validation of the feasibility—and identification of key bottlenecks—in automating higher-order thinking assessment. This work establishes a methodological benchmark and offers actionable insights for AI-driven educational assessment.

Technology Category

Application Category

📝 Abstract

Critical thinking represents a fundamental competency in today's education landscape. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in the learning analytics community on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring core "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking: student-written argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, GPT-5-mini, and ModernBERT). GPT-5 with few-shot prompting achieved the strongest results and demonstrated particular strength on subskills with separable, frequent categories, while lower performance was observed for subskills that required detection of subtle distinctions or rare categories. Our results underscore critical trade-offs in automated critical thinking assessment: proprietary models offer superior reliability at higher cost, while open-source alternatives provide practical accuracy with reduced sensitivity to minority categories. Our work represents an initial step toward scalable assessment of higher-order reasoning skills across authentic educational contexts.

Problem

Research questions and friction points this paper is trying to address.

Automating assessment of critical thinking subskills in education

Evaluating LLM approaches for scoring argumentative student essays

Addressing trade-offs between reliability and cost in automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated scoring using large language models

Few-shot prompting for critical thinking assessment

Evaluating proprietary versus open-source model trade-offs

🔎 Similar Papers

Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options