Natural Language-based Assessment of L2 Oral Proficiency using LLMs

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can directly interpret and apply human-authored, action-oriented “can-do” descriptors—standardized in second-language assessment—to perform natural language–based spoken proficiency evaluation without fine-tuning. Method: We propose a zero-shot paradigm that directly feeds CEFR-aligned natural language descriptors as instructions to the open-weight LLM Qwen-2.5-72B, enabling end-to-end assessment of spoken performance on the S&I corpus. The approach relies solely on textual input, requiring no task-specific training, feature engineering, or architectural modification. Contribution/Results: Our method achieves higher accuracy than a BERT-based model fine-tuned specifically for this task. It further demonstrates superior generalization across diverse tasks, low-resource settings, and instruction–task misalignment scenarios, while offering enhanced interpretability. These results validate LLMs as trustworthy, plug-and-play assessment agents capable of operationalizing human-defined proficiency criteria without parameter adaptation.

Technology Category

Application Category

📝 Abstract

Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.

Problem

Research questions and friction points this paper is trying to address.

Assessing L2 oral proficiency using LLMs and can-do descriptors

Comparing LLM-based assessment with human and specialized models

Exploring generalizability and interpretability of NLA across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses open-source LLM Qwen 2.5 72B

Applies zero-shot assessment technique

Relies on textual can-do descriptors

🔎 Similar Papers

No similar papers found.