🤖 AI Summary
This study addresses the end-to-end translation of Swiss German Sign Language (video) to German (text), tackling poor generalization stemming from insufficient spatiotemporal feature modeling. We propose a unified video-to-text sequence-to-sequence framework that jointly extracts frame-level spatiotemporal dynamics using CNNs and RNNs, and introduces a customized input embedding for joint optimization. Unlike cascaded pipeline approaches, our architecture performs end-to-end learning of sign visual representations and linguistic generation, explicitly modeling temporal action dependencies. Experiments yield 5.0±1.0 BLEU on the development set but only 0.11±0.06 BLEU on the test set—revealing severe overfitting and underscoring the critical challenge of improving cross-sample spatiotemporal generalization. Our primary contributions are: (i) establishing the first end-to-end translation baseline for Swiss German Sign Language, and (ii) systematically validating the impact of joint spatiotemporal modeling on translation performance.
📝 Abstract
This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11pm0.06$ BLEU points.