TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English

📅 2025-11-13

🏛️ Proceedings of The Third Arabic Natural Language Processing Conference

📈 Citations: 0

✨ Influential: 0

career value

128K/year

🤖 AI Summary

Data scarcity severely hampers Tunisian Arabic–English speech translation, particularly due to the absence of publicly available code-switched speech corpora, impeding low-resource dialectal NLP research. To address this, we introduce TEDxTN—the first open-source, code-switched Tunisian Arabic–English speech translation corpus—covering speakers from 11+ regions in Tunisia, 108 TEDx talks, and 25 hours of audio. The corpus features professional segmentation, bilingual transcription, human translation, and a publicly released annotation guideline. Methodologically, we present the first end-to-end joint modeling framework for automatic speech recognition and speech translation, leveraging pretrained models fine-tuned on TEDxTN to establish a strong, reproducible baseline. TEDxTN fills a critical gap in Arabic dialectal speech translation resources and serves as a foundational benchmark for low-resource spoken language translation, code-switching modeling, and dialectal NLP.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research on the natural language processing of Tunisian Dialect.

Problem

Research questions and friction points this paper is trying to address.

Developing first Tunisian Arabic-English speech translation corpus

Addressing data scarcity for Arabic dialect processing

Providing code-switched speech resources for NLP research

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Tunisian Arabic-English speech translation dataset

Collected 108 TEDx talks with code-switching annotations

Used pre-trained end-to-end models for baseline systems

🔎 Similar Papers

No similar papers found.