OCR-Enhanced Multimodal ASR Can Read While Listening

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes Donut-Whisper, a multimodal model for English–Chinese speech recognition that enhances performance by fusing audio inputs with visual captions extracted via OCR. The architecture employs dual encoders—Whisper for audio and Donut for visual features—combined with linear and Q-Former-based modality alignment mechanisms, followed by cross-attention layers to enable effective cross-modal feature alignment. A lightweight knowledge distillation strategy is further introduced, where the multimodal model guides the training of a pure audio-only model. To support this research, the authors construct the first bilingual (English–Chinese) audiovisual speech recognition dataset derived from movie clips. Experimental results demonstrate significant improvements over the Whisper Large V3 baseline, achieving a 5.75% relative reduction in English word error rate (WER) and a 16.5% reduction in Chinese character error rate (CER).

Technology Category

Application Category

📝 Abstract

Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.

Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition

multimodal ASR

visual information

OCR

multilingual

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal ASR

Donut-Whisper

cross-attention alignment