Spatio-temporal Sign Language Representation and Translation

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the end-to-end translation of Swiss German Sign Language (video) to German (text), tackling poor generalization stemming from insufficient spatiotemporal feature modeling. We propose a unified video-to-text sequence-to-sequence framework that jointly extracts frame-level spatiotemporal dynamics using CNNs and RNNs, and introduces a customized input embedding for joint optimization. Unlike cascaded pipeline approaches, our architecture performs end-to-end learning of sign visual representations and linguistic generation, explicitly modeling temporal action dependencies. Experiments yield 5.0±1.0 BLEU on the development set but only 0.11±0.06 BLEU on the test set—revealing severe overfitting and underscoring the critical challenge of improving cross-sample spatiotemporal generalization. Our primary contributions are: (i) establishing the first end-to-end translation baseline for Swiss German Sign Language, and (ii) systematically validating the impact of joint spatiotemporal modeling on translation performance.

Technology Category

Application Category

📝 Abstract

This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11pm0.06$ BLEU points.

Problem

Research questions and friction points this paper is trying to address.

Developing spatio-temporal sign language translation from video

Creating end-to-end architecture for better generalization

Addressing performance drop between development and test sets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns spatio-temporal features in single model

Uses end-to-end architecture for generalization

Extracts features from video frames directly

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale