CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Low-resource Cantonese ASR faces challenges including complex tonal inventory (six lexical tones plus tone sandhi), scarcity of annotated data, and substantial speaker accent variation, leading to high error rates in mainstream models (e.g., Whisper). To address this, we propose an acoustic–language model collaborative error correction framework: (1) prosody-aware acoustic features are extracted via forced alignment; (2) Whisper is fine-tuned using LoRA to enhance tonal discrimination; and (3) an instruction-tuned Qwen-Audio large language model integrates explicit acoustic cues for context-aware error rectification. Our key contribution is the first tone-aware ASR–LM collaboration mechanism, which explicitly models prosodic information to improve correction robustness. Experiments on spontaneous Cantonese speech demonstrate significant improvements over Whisper-Large-V3, with substantial reductions in character error rate—validating the efficacy and scalability of synergizing acoustic priors with large-model reasoning.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

Problem

Research questions and friction points this paper is trying to address.

Improving Cantonese ASR accuracy for low-resource tonal language scenarios

Addressing tone discrimination challenges caused by lexical tones and sandhi

Integrating acoustic prosodic cues with large language model reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forced alignment extracts acoustic features

LoRA-finetuned Whisper improves tone discrimination

Instruction-tuned Qwen-Audio enables prosody-aware correction

🔎 Similar Papers

No similar papers found.