Tagarela - A Portuguese speech dataset from podcasts

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the longstanding scarcity of large-scale, high-quality open-source speech datasets for Portuguese, which has significantly hindered progress in automatic speech recognition (ASR) and text-to-speech (TTS) technologies. To bridge this gap, the authors present the first Portuguese podcast speech dataset comparable in scale to English GigaSpeech, comprising 8,972 hours of audio. Efficient and accurate annotation is achieved through an integrated pipeline combining audio preprocessing, an ASR model trained on transcriptions from a high-fidelity API, and a hybrid transcription strategy. Models trained on this dataset demonstrate substantial performance gains in both ASR and TTS, confirming its utility and effectiveness. The dataset has been publicly released, thereby filling a critical void in Portuguese speech resources.

Technology Category

Application Category

📝 Abstract

Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.

Problem

Research questions and friction points this paper is trying to address.

Portuguese speech dataset

under-resourced language

automatic speech recognition

text-to-speech

speech processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Portuguese speech dataset

large-scale ASR

hybrid transcription pipeline