Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address poor generalization in cross-modal alignment between sign language videos and spoken-language subtitles, this paper proposes SEA, a general-purpose framework that decouples sign language segment segmentation from cross-modal alignment—eliminating the need for end-to-end training. SEA integrates a pre-trained sign language segmentation model and a multilingual text–sign cross-modal embedding model, coupled with a lightweight dynamic programming algorithm for efficient and transferable temporal alignment. To our knowledge, SEA is the first method enabling zero-shot cross-lingual and cross-domain alignment. It achieves state-of-the-art performance on four sign language datasets and processes one hour of video in under one minute on a single CPU core. The code and models are publicly released, providing a scalable, easily deployable solution for low-resource sign language understanding and parallel corpus construction.

Technology Category

Application Category

📝 Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Problem

Research questions and friction points this paper is trying to address.

Align subtitles to sign language videos universally

Overcome language-specific limitations in prior methods

Enable efficient cross-language and domain subtitle alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal framework for multi-language subtitle alignment

Leverages pretrained models for segmentation and embedding

Efficient CPU-based dynamic programming for fast alignment

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale