Tagarela - A Portuguese speech dataset from podcasts

๐Ÿ“… 2026-03-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the longstanding scarcity of large-scale, high-quality open-source speech datasets for Portuguese, which has significantly hindered progress in automatic speech recognition (ASR) and text-to-speech (TTS) technologies. To bridge this gap, the authors present the first Portuguese podcast speech dataset comparable in scale to English GigaSpeech, comprising 8,972 hours of audio. Efficient and accurate annotation is achieved through an integrated pipeline combining audio preprocessing, an ASR model trained on transcriptions from a high-fidelity API, and a hybrid transcription strategy. Models trained on this dataset demonstrate substantial performance gains in both ASR and TTS, confirming its utility and effectiveness. The dataset has been publicly released, thereby filling a critical void in Portuguese speech resources.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.
Problem

Research questions and friction points this paper is trying to address.

Portuguese speech dataset
under-resourced language
automatic speech recognition
text-to-speech
speech processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Portuguese speech dataset
large-scale ASR
hybrid transcription pipeline
podcast-based corpus
speech technology for low-resource languages
๐Ÿ”Ž Similar Papers
No similar papers found.
F
Frederico Santos de Oliveira
Federal University of Mato Grosso (UFMT)
L
Lucas Rafael Stefanel Gris
Federal University of Goias (UFG)
Alef Iury Siqueira Ferreira
Alef Iury Siqueira Ferreira
Universidade Federal de Goiรกs
Machine LearningDeep LearningSpeech RecognitionBioacousticsNatural Language Processing
A
Augusto Seben da Rosa
Paulista State University (UNESP)
A
Alexandre Costa Ferro Filho
Federal University of Goias (UFG)
Edresson Casanova
Edresson Casanova
Senior Research Scientist at NVIDIA
Text-to-SpeechSpeech SynthesisSpeech processingDuplex S2S
C
Christopher Dane Shulby
Elsa Speak
R
Rafael Teixeira Sousa
Federal University of Mato Grosso (UFMT)
D
Diogo Fernandes Costa Silva
Federal University of Goias (UFG)
Anderson da Silva Soares
Anderson da Silva Soares
Deep Learning Brazil at Federal University of Goias.
Deep LearningPattern Recognition and Artificial Intelligence
A
Arlindo Rodrigues Galvรฃo Filho
Federal University of Goias (UFG)