Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual speech recognition (AVSR) methods predominantly rely on a single visual modality—typically lip movements—while overlooking the potential of synergistic modeling across multiple visual cues. This work proposes LiPS-AVSR, the first AVSR framework to incorporate presentation slides as structured visual cues alongside lip motion for Chinese AVSR. To support this, we introduce the first 100-hour Chinese audio-visual dataset jointly annotated with lip-motion and PowerPoint slide modalities. Methodologically, LiPS-AVSR comprises three core components: lip-feature extraction, slide text-visual representation alignment, and end-to-end audio-visual joint decoding. Experiments demonstrate that lip reading alone improves ASR performance by ~8%, slide information contributes ~25%, and their fusion yields a 35% relative WER reduction over audio-only baselines—substantially outperforming unimodal counterparts. This work pioneers the expansion of AVSR’s visual modality dimensionality and semantic granularity.

Technology Category

Application Category

📝 Abstract
Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively, with a combined performance improvement of about 35%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/
Problem

Research questions and friction points this paper is trying to address.

Combining lip-reading and slides for better speech recognition
Lack of datasets with multiple visual cues in AVSR
Improving Chinese ASR using multimodal visual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines lip-reading and presentation slides
Develops LiPS-AVSR multimodal pipeline
Improves ASR performance by 35%
🔎 Similar Papers
No similar papers found.
Jinghua Zhao
Jinghua Zhao
Nankai University
Y
Yuhang Jia
College of Computer Science, Nankai University, Tianjin, China
S
Shiyao Wang
College of Computer Science, Nankai University, Tianjin, China
J
Jiaming Zhou
College of Computer Science, Nankai University, Tianjin, China
H
Hui Wang
College of Computer Science, Nankai University, Tianjin, China
Y
Yong Qin
College of Computer Science, Nankai University, Tianjin, China