🤖 AI Summary
Current annotation-free sign language translation (SLT) models struggle to accurately recognize high-speed finger spelling and effectively integrate asynchronous facial cues—such as lip movements—leading to degraded translation performance for critical information like proper nouns. To address this, we propose a modular multi-expert parallel architecture that separately models continuous signing, finger spelling, and lip-reading signals. Cross-modal coordination is achieved via a lightweight temporal alignment Transformer, followed by translation generation using a large language model. This design explicitly decouples complex multimodal couplings, significantly improving fidelity of key linguistic elements. Evaluated on the How2Sign dataset, our approach achieves a BLEU-4 score of 23.5; on the Chicago Fingerspelling dataset, it attains a letter accuracy of 73.2%, establishing new state-of-the-art results on both benchmarks.
📝 Abstract
Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.