A Chinese Continuous Sign Language Dataset Based on Complex Environments

📅 2024-09-18

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

248K/year

🤖 AI Summary

Existing continuous sign language recognition (CSLR) methods are largely constrained to laboratory or TV-scene datasets, exhibiting limited environmental diversity and poor generalization. To address this, we introduce CE-CSL—the first large-scale Chinese CSLR dataset designed for complex real-world scenarios—comprising 5,988 daily-life videos across 70+ heterogeneous backgrounds. Methodologically, we present the first systematic modeling of environmental diversity in CSLR and propose TFNet, a novel architecture that jointly leverages multi-scale temporal convolutions and frequency-domain short-time Fourier transform (STFT) to enable robust frame-to-sequence representation learning. Evaluated within an RGB end-to-end framework, TFNet achieves significant accuracy gains on CE-CSL. Cross-dataset transfer experiments on three public Chinese sign language benchmarks consistently attain state-of-the-art (SOTA) or near-SOTA performance, demonstrating strong generalization capability and practical applicability.

Technology Category

Application Category

📝 Abstract

The current bottleneck in continuous sign language recognition (CSLR) research lies in the fact that most publicly available datasets are limited to laboratory environments or television program recordings, resulting in a single background environment with uniform lighting, which significantly deviates from the diversity and complexity found in real-life scenarios. To address this challenge, we have constructed a new, large-scale dataset for Chinese continuous sign language (CSL) based on complex environments, termed the complex environment - chinese sign language dataset (CE-CSL). This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes, featuring more than 70 different complex backgrounds to ensure representativeness and generalization capability. To tackle the impact of complex backgrounds on CSLR performance, we propose a time-frequency network (TFNet) model for continuous sign language recognition. This model extracts frame-level features and then utilizes both temporal and spectral information to separately derive sequence features before fusion, aiming to achieve efficient and accurate CSLR. Experimental results demonstrate that our approach achieves significant performance improvements on the CE-CSL, validating its effectiveness under complex background conditions. Additionally, our proposed method has also yielded highly competitive results when applied to three publicly available CSL datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited diversity in sign language datasets

Developing recognition model for complex real-world environments

Improving performance under varied lighting and backgrounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Complex environment dataset with diverse backgrounds

Time-frequency network model for feature extraction

Separate temporal and spectral information before fusion

🔎 Similar Papers

Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm