🤖 AI Summary
To address poor generalization in cross-modal alignment between sign language videos and spoken-language subtitles, this paper proposes SEA, a general-purpose framework that decouples sign language segment segmentation from cross-modal alignment—eliminating the need for end-to-end training. SEA integrates a pre-trained sign language segmentation model and a multilingual text–sign cross-modal embedding model, coupled with a lightweight dynamic programming algorithm for efficient and transferable temporal alignment. To our knowledge, SEA is the first method enabling zero-shot cross-lingual and cross-domain alignment. It achieves state-of-the-art performance on four sign language datasets and processes one hour of video in under one minute on a single CPU core. The code and models are publicly released, providing a scalable, easily deployable solution for low-resource sign language understanding and parallel corpus construction.
📝 Abstract
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.