The TUB Sign Language Corpus Collection

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current sign language NLP research is severely constrained by the scarcity of large-scale, multilingual parallel corpora—particularly for Latin American Sign Languages (LALS) and German Sign Language (DGS). To address this, we present the first systematic construction of a parallel video corpus covering 12 sign languages, including eight previously uncurated LALS variants and a tenfold expansion of DGS data. The corpus comprises 4,381 licensed videos (>1,300 hours) and 1.3 million precisely aligned subtitles (14 million tokens), sourced from publicly available news, governmental, and educational channels. Our methodology integrates web crawling, domain-adapted video segmentation, multi-stage cleaning, and cross-lingual alignment techniques. This resource constitutes the largest publicly available multilingual sign language parallel corpus to date, enabling substantial advances in sign language machine translation, cross-lingual representation learning, and low-resource sign language NLP.

Technology Category

Application Category

📝 Abstract

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

Problem

Research questions and friction points this paper is trying to address.

Creating parallel corpora for 12 sign languages

Including first consistent corpora for 8 Latin American sign languages

Expanding German Sign Language corpus tenfold

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel corpora of 12 sign languages

1300 hours in 4381 video files

Data collection from online sources

🔎 Similar Papers

No similar papers found.