🤖 AI Summary
This work proposes Donut-Whisper, a multimodal model for English–Chinese speech recognition that enhances performance by fusing audio inputs with visual captions extracted via OCR. The architecture employs dual encoders—Whisper for audio and Donut for visual features—combined with linear and Q-Former-based modality alignment mechanisms, followed by cross-attention layers to enable effective cross-modal feature alignment. A lightweight knowledge distillation strategy is further introduced, where the multimodal model guides the training of a pure audio-only model. To support this research, the authors construct the first bilingual (English–Chinese) audiovisual speech recognition dataset derived from movie clips. Experimental results demonstrate significant improvements over the Whisper Large V3 baseline, achieving a 5.75% relative reduction in English word error rate (WER) and a 16.5% reduction in Chinese character error rate (CER).
📝 Abstract
Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.