🤖 AI Summary
This study addresses automated scoring for large-scale, multi-item second-language (L2) speaking assessments in computer-assisted language learning. Methodologically, it proposes a novel end-to-end automatic scoring (ASA) paradigm that employs a single Whisper-small speech encoder to process all spoken responses uniformly—eliminating conventional ASR-based transcription and item-specific models. A lightweight feature aggregation module is designed to fuse cross-item information, integrated into an end-to-end regression architecture for direct holistic score prediction. To mitigate class imbalance, a targeted data sampling strategy is introduced. The model achieves an RMSE of 0.383 using only 44.8% of speaker data (a 55% reduction versus baseline training samples), substantially outperforming text-based baselines (RMSE = 0.44). With just 168 million parameters, it demonstrates high data efficiency, strong generalization across item types, and low inference overhead.
📝 Abstract
We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems.
Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.