TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

๐Ÿ“… 2026-03-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenge of intent recognition for Taiwanese Hokkienโ€”a low-resource, primarily spoken language lacking standardized orthography and real-world speech dataโ€”by introducing TaigiSpeech, a dataset comprising 3,000 utterances from 21 elderly speakers. To overcome the absence of a standard writing system, the authors propose a scalable weakly supervised data mining framework that integrates keyword matching, intermediate-language pseudo-labels generated by large language models, and audio-visual multimodal cues. This approach enables effective dataset construction for low-resource spoken languages. The publicly released TaigiSpeech dataset, licensed under CC BY 4.0, serves as a vital resource for advancing research in this domain and provides empirical validation for the proposed methodology.

Technology Category

Application Category

๐Ÿ“ Abstract
Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
speech intent dataset
unwritten languages
Taiwanese Taigi
real-world speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource speech
data mining
multimodal learning
pseudo labeling
spoken language dataset
๐Ÿ”Ž Similar Papers
No similar papers found.