Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor generalization in cross-modal alignment between sign language videos and spoken-language subtitles, this paper proposes SEA, a general-purpose framework that decouples sign language segment segmentation from cross-modal alignment—eliminating the need for end-to-end training. SEA integrates a pre-trained sign language segmentation model and a multilingual text–sign cross-modal embedding model, coupled with a lightweight dynamic programming algorithm for efficient and transferable temporal alignment. To our knowledge, SEA is the first method enabling zero-shot cross-lingual and cross-domain alignment. It achieves state-of-the-art performance on four sign language datasets and processes one hour of video in under one minute on a single CPU core. The code and models are publicly released, providing a scalable, easily deployable solution for low-resource sign language understanding and parallel corpus construction.

Technology Category

Application Category

📝 Abstract
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
Problem

Research questions and friction points this paper is trying to address.

Align subtitles to sign language videos universally
Overcome language-specific limitations in prior methods
Enable efficient cross-language and domain subtitle alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal framework for multi-language subtitle alignment
Leverages pretrained models for segmentation and embedding
Efficient CPU-based dynamic programming for fast alignment
🔎 Similar Papers
No similar papers found.