🤖 AI Summary
This study investigates whether large language models (LLMs) can directly interpret and apply human-authored, action-oriented “can-do” descriptors—standardized in second-language assessment—to perform natural language–based spoken proficiency evaluation without fine-tuning.
Method: We propose a zero-shot paradigm that directly feeds CEFR-aligned natural language descriptors as instructions to the open-weight LLM Qwen-2.5-72B, enabling end-to-end assessment of spoken performance on the S&I corpus. The approach relies solely on textual input, requiring no task-specific training, feature engineering, or architectural modification.
Contribution/Results: Our method achieves higher accuracy than a BERT-based model fine-tuned specifically for this task. It further demonstrates superior generalization across diverse tasks, low-resource settings, and instruction–task misalignment scenarios, while offering enhanced interpretability. These results validate LLMs as trustworthy, plug-and-play assessment agents capable of operationalizing human-defined proficiency criteria without parameter adaptation.
📝 Abstract
Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.