The TUB Sign Language Corpus Collection

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current sign language NLP research is severely constrained by the scarcity of large-scale, multilingual parallel corpora—particularly for Latin American Sign Languages (LALS) and German Sign Language (DGS). To address this, we present the first systematic construction of a parallel video corpus covering 12 sign languages, including eight previously uncurated LALS variants and a tenfold expansion of DGS data. The corpus comprises 4,381 licensed videos (>1,300 hours) and 1.3 million precisely aligned subtitles (14 million tokens), sourced from publicly available news, governmental, and educational channels. Our methodology integrates web crawling, domain-adapted video segmentation, multi-stage cleaning, and cross-lingual alignment techniques. This resource constitutes the largest publicly available multilingual sign language parallel corpus to date, enabling substantial advances in sign language machine translation, cross-lingual representation learning, and low-resource sign language NLP.

Technology Category

Application Category

📝 Abstract
We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.
Problem

Research questions and friction points this paper is trying to address.

Creating parallel corpora for 12 sign languages
Including first consistent corpora for 8 Latin American sign languages
Expanding German Sign Language corpus tenfold
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel corpora of 12 sign languages
1300 hours in 4381 video files
Data collection from online sources
🔎 Similar Papers
No similar papers found.
Eleftherios Avramidis
Eleftherios Avramidis
Senior Researcher at German Research Center for Artificial Intelligence (DFKI)
LLMsMultilingualityMachine TranslationSign Language Processing
V
Vera Czehmann
German Research Center for AI (DFKI), Speech and Language Technology; Technische Universität Berlin
F
Fabian Deckert
Technische Universität Berlin
Lorenz Hufe
Lorenz Hufe
Fraunhofer HHI
InterpretabilityComputer VisionNatural Language Processing
A
Aljoscha Lipski
Technische Universität Berlin
Y
Yuni Amaloa Quintero Villalobos
Technische Universität Berlin
T
Tae Kwon Rhee
Technische Universität Berlin
M
Mengqian Shi
Technische Universität Berlin
L
Lennart Stölting
Technische Universität Berlin
Fabrizio Nunnari
Fabrizio Nunnari
German Research Center for Artificial Intelligence (DFKI), Saarland Informatics Campus
Computer-Human InteractionSign LanguageAvatarsVirtual Interpreters
Sebastian Möller
Sebastian Möller
Professor for Quality and Usability, TU Berlin and Scientific Director, DFKI
Quality of ExperienceUser ExperienceSpeechDialogNatural Language Processing