Uni-Sign: Toward Unified Sign Language Understanding at Scale

📅 2025-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sign language understanding (SLU) methods suffer from a significant gap between pretraining and downstream tasks, resulting in limited generalization and robustness. To address this, we propose a unified SLU framework: (1) We reformulate all SLU tasks—including translation, recognition, and comprehension—as a generative sign language translation task, establishing a unified task paradigm; (2) We introduce a Prior-Guided Fusion (PGF) module and a score-aware dynamic sampling strategy to robustly integrate pose and RGB multimodal features; (3) We construct CSL-News, a large-scale Chinese sign language video–text dataset comprising 1,985 hours of annotated data. Our framework achieves state-of-the-art performance across multiple benchmarks, with substantial improvements in translation accuracy, sign recognition, and semantic understanding. Both the source code and the CSL-News dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose modelname, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, modelname unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that modelname achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at url{github.com/ZechengLi19/Uni-Sign}.
Problem

Research questions and friction points this paper is trying to address.

Sign Language Understanding
Effectiveness Improvement
Machine Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

ModelName
Pre-training and Fine-tuning Strategy
PGF Module and Sampling
🔎 Similar Papers
No similar papers found.
Z
Zecheng Li
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Wengang Zhou
Wengang Zhou
Professor, EEIS Department, University of Science and Technology of China
Multimedia RetrievalComputer VisionComputer Game
W
Weichao Zhao
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
K
Kepeng Wu
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Hezhen Hu
Hezhen Hu
University of Texas at Austin
Sign Language RecognitionSign Language TranslationVideo Understanding
Houqiang Li
Houqiang Li
Professor, Department of Electric Engineering and Information Science, University of Science and
Multimedia SearchImage/Video AnalysisImage/Video Coding