One Whisper to Grade Them All

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses automated scoring for large-scale, multi-item second-language (L2) speaking assessments in computer-assisted language learning. Methodologically, it proposes a novel end-to-end automatic scoring (ASA) paradigm that employs a single Whisper-small speech encoder to process all spoken responses uniformly—eliminating conventional ASR-based transcription and item-specific models. A lightweight feature aggregation module is designed to fuse cross-item information, integrated into an end-to-end regression architecture for direct holistic score prediction. To mitigate class imbalance, a targeted data sampling strategy is introduced. The model achieves an RMSE of 0.383 using only 44.8% of speaker data (a 55% reduction versus baseline training samples), substantially outperforming text-based baselines (RMSE = 0.44). With just 168 million parameters, it demonstrates high data efficiency, strong generalization across item types, and low inference overhead.

Technology Category

Application Category

📝 Abstract
We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.
Problem

Research questions and friction points this paper is trying to address.

Efficient end-to-end automatic speaking assessment for multi-part tests
Single model processing all responses without transcription
Improved performance with reduced data and parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single Whisper-small encoder for multi-part processing
Lightweight aggregator combines all response information
Data sampling strategy enhances efficiency and performance
🔎 Similar Papers
No similar papers found.