Spatio-temporal Sign Language Representation and Translation

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the end-to-end translation of Swiss German Sign Language (video) to German (text), tackling poor generalization stemming from insufficient spatiotemporal feature modeling. We propose a unified video-to-text sequence-to-sequence framework that jointly extracts frame-level spatiotemporal dynamics using CNNs and RNNs, and introduces a customized input embedding for joint optimization. Unlike cascaded pipeline approaches, our architecture performs end-to-end learning of sign visual representations and linguistic generation, explicitly modeling temporal action dependencies. Experiments yield 5.0±1.0 BLEU on the development set but only 0.11±0.06 BLEU on the test set—revealing severe overfitting and underscoring the critical challenge of improving cross-sample spatiotemporal generalization. Our primary contributions are: (i) establishing the first end-to-end translation baseline for Swiss German Sign Language, and (ii) systematically validating the impact of joint spatiotemporal modeling on translation performance.

Technology Category

Application Category

📝 Abstract
This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11pm0.06$ BLEU points.
Problem

Research questions and friction points this paper is trying to address.

Developing spatio-temporal sign language translation from video
Creating end-to-end architecture for better generalization
Addressing performance drop between development and test sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns spatio-temporal features in single model
Uses end-to-end architecture for generalization
Extracts features from video frames directly
🔎 Similar Papers
No similar papers found.