Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current sign language translation (SLT) datasets suffer from limited scale, narrow language coverage, and heavy reliance on costly expert annotation. To address these challenges, we propose the first end-to-end, vision-language model (VLM)-driven automated data construction framework for multilingual sign language data, leveraging social media videos. Our method integrates facial visibility detection, sign action recognition, OCR-based text extraction, and audio-video cross-modal alignment verification to enable efficient collection, filtering, and weakly supervised annotation—substantially reducing manual labeling effort. Applying this framework to TikTok videos, we curate TikTok-SL-8, a corpus covering eight sign languages plus German Sign Language. We empirically validate the robustness of mainstream SLT models under label noise and establish a new multilingual pretraining benchmark. Our core contribution lies in the first systematic integration of VLMs across the entire sign language data curation pipeline, enabling low-cost, high-quality, and scalable weakly supervised dataset construction.

Technology Category

Application Category

📝 Abstract
Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
Problem

Research questions and friction points this paper is trying to address.

Automating sign language data annotation using vision-language models
Reducing manual curation costs for multilingual sign language datasets
Enabling scalable sign language translation from social media
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based automated annotation and filtering framework
Face visibility and sign activity recognition pipeline
Text-video alignment validation for social media data
🔎 Similar Papers
No similar papers found.