Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing ASR, SLT, and VSR systems model speech, sign language, and lip movements in isolation, lacking a unified multimodal framework to address communication needs of deaf and hard-of-hearing individuals. Method: We propose the first inclusive speech recognition framework integrating sign language, lip motion, and audio modalities. Built upon a large language model backbone, it employs a modality-agnostic encoder-decoder architecture with cross-modal alignment to unify heterogeneous inputs. Crucially, we introduce a disentangled lip-motion modeling module, revealing lip articulation as a critical non-manual cue in sign language. Contribution/Results: The framework jointly optimizes SLT, VSR, and ASR tasks end-to-end. It achieves performance on par with or surpassing state-of-the-art unimodal models on each task—demonstrating, for the first time, the synergistic benefits and generalization capability of unified multimodal modeling across all three modalities.

Technology Category

Application Category

📝 Abstract

Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such systems remain inherently inaccessible to individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and AVSR. Furthermore, our analysis reveals that explicitly modeling lip movements as a separate modality significantly improves SLT performance.

Problem

Research questions and friction points this paper is trying to address.

Creating unified framework for sign language, lip movements, and audio understanding

Integrating isolated visual communication modalities into single architecture

Improving accessibility for deaf individuals through multimodal speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified LLM framework for multimodal communication understanding

Modality-agnostic architecture processing sign language and audio

Lip movement modeling enhances sign language translation performance

🔎 Similar Papers

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs