🤖 AI Summary
Current sign language NLP research is severely constrained by the scarcity of large-scale, multilingual parallel corpora—particularly for Latin American Sign Languages (LALS) and German Sign Language (DGS). To address this, we present the first systematic construction of a parallel video corpus covering 12 sign languages, including eight previously uncurated LALS variants and a tenfold expansion of DGS data. The corpus comprises 4,381 licensed videos (>1,300 hours) and 1.3 million precisely aligned subtitles (14 million tokens), sourced from publicly available news, governmental, and educational channels. Our methodology integrates web crawling, domain-adapted video segmentation, multi-stage cleaning, and cross-lingual alignment techniques. This resource constitutes the largest publicly available multilingual sign language parallel corpus to date, enabling substantial advances in sign language machine translation, cross-lingual representation learning, and low-resource sign language NLP.
📝 Abstract
We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.