🤖 AI Summary
This study investigates whether students’ counterargument writing in generative AI contexts reflects critical thinking and evaluates the capacity of large language models (LLMs) to reliably substitute human raters in assessing such writing. Students composed rebuttals on prominent debate topics, which were then evaluated quantitatively and qualitatively by six state-of-the-art LLMs and human reviewers across six analytical dimensions using Likert scales. Inter-rater agreement was assessed via Gwet’s AC2 coefficient, complemented by mixed-methods analysis. Findings indicate that student texts significantly exhibit key critical thinking components, particularly logical reasoning. Moreover, with the exception of one model, all LLMs demonstrated substantial alignment with human judgments (AC2 = 0.33), offering the first systematic evidence of LLMs’ reliability and practical potential in structured writing assessment.
📝 Abstract
This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context of Generative AI (GenAI). This is especially as risks of cheating and cognitive offloading exist with the use of GenAI. We presented 36 students in a particular university course with 4 carefully selected thesis statements (from a set of popular debates) to write about anyone of them. We used six established rubrics (focus, logic, content, style, correctness and reference) to conduct three human assessments (two student peer-reviews and one experienced teacher) per writeup on a 5-point Likert scale for all the qualified samples (n) of 35 submissions (after disqualifying one for irregularity). Using the same rubrics and guidelines, we also assessed the submissions using six frontier LLMs as judges. Our mixed-method design included qualitative open-ended feedback per assessment and quantitative methods. The results reveal that (1) the students' self-written counterarguments to AI-generated content contains logic, among other things, which is a key component of critical thinking, and (2) GenAI can be successfully used at scale to assess students' written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter-rater reliability values of 0.33 for all the models except one.