🤖 AI Summary
This work addresses the IWSLT 2025 Instruction-Following Speech Processing Short Track, targeting simultaneous English speech recognition (ASR), cross-lingual speech translation (into Chinese, Italian, and German), and speech-based question answering (SQA). Methodologically, we propose the first end-to-end unified multitask framework, introducing the first joint multimodal instruction fine-tuning of SeamlessM4T-v2-large’s speech encoder embeddings—projected into the Llama-3.1-8B-Instruct language model’s embedding space—with LoRA adapters. This enables integrated cross-modal, multitask, and multilingual modeling without task decoupling or cascaded error propagation, supporting strict-constraint joint inference across all three tasks. Experimental results demonstrate significantly improved consistency in cross-lingual speech understanding and generation. Evaluation on the IWSLT 2025 benchmark confirms both effectiveness and strong generalization across diverse speech-language tasks and target languages.
📝 Abstract
In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of a Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.