MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current annotation-free sign language translation (SLT) models struggle to accurately recognize high-speed finger spelling and effectively integrate asynchronous facial cues—such as lip movements—leading to degraded translation performance for critical information like proper nouns. To address this, we propose a modular multi-expert parallel architecture that separately models continuous signing, finger spelling, and lip-reading signals. Cross-modal coordination is achieved via a lightweight temporal alignment Transformer, followed by translation generation using a large language model. This design explicitly decouples complex multimodal couplings, significantly improving fidelity of key linguistic elements. Evaluated on the How2Sign dataset, our approach achieves a BLEU-4 score of 23.5; on the Chicago Fingerspelling dataset, it attains a letter accuracy of 73.2%, establishing new state-of-the-art results on both benchmarks.

Technology Category

Application Category

📝 Abstract
Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
Problem

Research questions and friction points this paper is trying to address.

Precise recognition of high-speed fingerspelling in sign language
Integration of asynchronous non-manual cues from facial expressions
Translating crucial information like names, places and technical terms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separate specialized predictors for different sign modalities
Lightweight transformer fuses parallel token streams
Large Language Model generates final translated sentences