🤖 AI Summary
Lip-reading models suffer significant performance degradation in cross-speaker scenarios, primarily due to the absence of speaker-specific modeling—particularly of idiosyncratic lip articulation patterns and lexical preferences—and limitations of existing datasets (small scale, constrained environments, and narrow vocabularies). To address this, we propose a dual-level (visual and linguistic) personalization framework: (1) the first linguistic-level speaker modeling—capturing individual lexical choice preferences; (2) VoxLRS-SA, the first large-scale, in-the-wild, sentence-level lip-reading adaptation dataset with a 100K-word vocabulary; and (3) a lightweight audio-visual dual-path adaptation strategy integrating Prompt Tuning and LoRA. Experiments on VoxLRS-SA demonstrate substantial improvements over strong baselines, establishing—for the first time—the effectiveness and generalizability of speaker adaptation for real-world, sentence-level lip reading.
📝 Abstract
Lip reading aims to predict spoken language by analyzing lip movements. Despite advancements in lip reading technologies, performance degrades when models are applied to unseen speakers due to their sensitivity to variations in visual information such as lip appearances. To address this challenge, speaker adaptive lip reading technologies have advanced by focusing on effectively adapting a lip reading model to target speakers in the visual modality. However, the effectiveness of adapting language information, such as vocabulary choice, of the target speaker has not been explored in previous works. Additionally, existing datasets for speaker adaptation have limited vocabulary sizes and pose variations, which restrict the validation of previous speaker-adaptive methods in real-world scenarios. To address these issues, we propose a novel speaker-adaptive lip reading method that adapts a pre-trained model to target speakers at both vision and language levels. Specifically, we integrate prompt tuning and the LoRA approach, applying them to a pre-trained lip reading model to effectively adapt the model to target speakers. Furthermore, to validate its effectiveness in real-world scenarios, we introduce a new dataset, VoxLRS-SA, derived from VoxCeleb2 and LRS3. It contains a vocabulary of approximately 100K words, offers diverse pose variations, and enables the validation of adaptation methods in the wild, sentence-level lip reading for the first time in English. Through various experiments, we demonstrate that the existing speaker-adaptive method also improves performance in the wild at the sentence level. Moreover, we show that the proposed method achieves larger improvements compared to the previous works.