🤖 AI Summary
Low-resource Cantonese ASR faces challenges including complex tonal inventory (six lexical tones plus tone sandhi), scarcity of annotated data, and substantial speaker accent variation, leading to high error rates in mainstream models (e.g., Whisper). To address this, we propose an acoustic–language model collaborative error correction framework: (1) prosody-aware acoustic features are extracted via forced alignment; (2) Whisper is fine-tuned using LoRA to enhance tonal discrimination; and (3) an instruction-tuned Qwen-Audio large language model integrates explicit acoustic cues for context-aware error rectification. Our key contribution is the first tone-aware ASR–LM collaboration mechanism, which explicitly models prosodic information to improve correction robustness. Experiments on spontaneous Cantonese speech demonstrate significant improvements over Whisper-Large-V3, with substantial reductions in character error rate—validating the efficacy and scalability of synergizing acoustic priors with large-model reasoning.
📝 Abstract
Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.