🤖 AI Summary
This work addresses the limitations of current speech translation systems, which are constrained by on-device resource scarcity, privacy risks associated with cloud processing, bandwidth bottlenecks, and a pervasive English-centric bias that impedes efficient and secure multilingual translation. To overcome these challenges, the authors propose ESRT, an edge–cloud collaborative framework that deploys a lightweight speech encoder on the device and transmits only highly compressed intermediate features to the cloud for translation. This design effectively prevents speaker identity leakage and reduces bandwidth consumption by an order of magnitude. Furthermore, the approach incorporates multitask-weighted curriculum learning and data balancing strategies to substantially mitigate language bias. Evaluated on the FLEURS benchmark, both ESRT-4B and ESRT-12B achieve state-of-the-art performance across all 45×44 translation directions, consistently outperforming existing methods.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.