Whisper-UT: A Unified Translation Framework for Speech and Text

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing encoder-decoder models exhibit poor adaptability and heavy reliance on costly tri-lingual parallel data (speech–source text–target text) for speech-to-text and multimodal machine translation (MMT) tasks. To address this, we propose a unified multimodal translation framework built upon the Whisper architecture. Our approach introduces lightweight adapters for efficient cross-task and cross-modal transfer, a multimodal conditional input mechanism that jointly processes speech and source-language text—using either ASR hypotheses or ground-truth transcripts as prompts—and a two-stage decoding strategy to enhance robustness in speech translation. Crucially, our method enables end-to-end cross-modal fine-tuning without requiring tri-lingual parallel corpora. Empirical evaluation on MMT benchmarks demonstrates substantial improvements in BLEU (+2.1) and COMET (+3.4) scores, validating its flexibility, parameter efficiency, and strong generalization across modalities and tasks.

Technology Category

Application Category

📝 Abstract

Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.

Problem

Research questions and friction points this paper is trying to address.

Efficiently adapting encoder-decoder models to diverse uni/multi-modal scenarios

Enabling seamless adaptation across speech and text translation tasks

Improving translation performance without requiring 3-way parallel data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight adapters enable seamless cross-task adaptation

Multi-modal translation using speech and text inputs

Two-stage decoding enhances speech translation performance

🔎 Similar Papers

No similar papers found.