🤖 AI Summary
This work addresses the significant performance gap in automatic speech recognition (ASR) between European and Brazilian Portuguese, primarily due to the scarcity of large-scale annotated speech data for the former. To bridge this gap, we introduce the FalAR corpus, comprising nearly 5,800 hours of parliamentary session recordings spanning two decades, with 4,850 hours meticulously annotated for speaker identity (covering 1,180 speakers) and rich metadata. Leveraging the CAMÕES ASR model, we employ an automated transcription–reference alignment pipeline followed by human verification to process the data. Experimental results demonstrate that pretraining on FalAR yields up to a 14% relative reduction in word error rate, substantially advancing ASR performance for European Portuguese and filling a critical void in high-quality linguistic resources for this variant.
📝 Abstract
State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.