USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Continuous Sign Language Recognition (CSLR) faces two key challenges: insufficient modeling of fine-grained hand and facial features, and difficulty capturing long-range temporal dependencies. To address these, we propose an end-to-end, single-stream RGB video understanding framework built upon the Swin Transformer backbone. We introduce TAPE (Temporal Adapter for Pose Estimation), a lightweight, novel temporal adapter enabling deep spatial–temporal coupling—without requiring multi-stream inputs or auxiliary modalities (e.g., skeletons or depth maps). By integrating joint spatiotemporal encoding with refined positional modeling, our approach significantly enhances representation learning for dynamic gestures and facial cues. On PHOENIX14/T and CSL-Daily benchmarks, our purely RGB-based method achieves state-of-the-art performance—outperforming existing multimodal approaches and matching complex multi-stream systems—demonstrating the effectiveness and generalizability of lightweight, unified spatiotemporal modeling.

Technology Category

Application Category

📝 Abstract

Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

Problem

Research questions and friction points this paper is trying to address.

Captures fine-grained hand and facial cues

Models long-range temporal dependencies in sign language

Uses only RGB video without extra modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin Transformer backbone with TAPE adapters

Unified spatio-temporal modeling for fine-grained cues

RGB-only robust recognition without multi-stream inputs

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale