MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address two key bottlenecks in multilingual speech-to-text translation (S2TT) using multimodal large language models (MLLMs)—narrow language coverage (English-centric bias) and low inference efficiency (due to computationally expensive long speech sequences)—this paper proposes a lightweight, scalable many-to-many S2TT framework. We introduce a curriculum learning–driven language expansion strategy coupled with data balancing to support 70 languages. A lightweight speech adapter module compresses speech representations into approximately 30 tokens, drastically improving inference speed. With only 10 hours of data per language and ~100M trainable parameters, our model outperforms existing end-to-end approaches across all 70×69 translation directions on the FLEURS benchmark, achieving higher BLEU scores and superior batch throughput. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.

Problem

Research questions and friction points this paper is trying to address.

Scaling many-to-many speech translation to 70 languages

Reducing speech sequence length for faster inference

Overcoming English-centric data limitations in S2TT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum learning scales language coverage to 70 languages

Data balancing strategy enables mutual translation among languages

Optimized speech adapter reduces speech sequence to 30 tokens

🔎 Similar Papers

No similar papers found.

Nvidia

base salary range is 192,000 USD - 304,750 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

US, CA, Santa Clara

Authors to Follow