NAVER LABS Europe Submission to the Instruction-following Track

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the IWSLT 2025 Instruction-Following Speech Processing Short Track, targeting simultaneous English speech recognition (ASR), cross-lingual speech translation (into Chinese, Italian, and German), and speech-based question answering (SQA). Methodologically, we propose the first end-to-end unified multitask framework, introducing the first joint multimodal instruction fine-tuning of SeamlessM4T-v2-large’s speech encoder embeddings—projected into the Llama-3.1-8B-Instruct language model’s embedding space—with LoRA adapters. This enables integrated cross-modal, multitask, and multilingual modeling without task decoupling or cascaded error propagation, supporting strict-constraint joint inference across all three tasks. Experimental results demonstrate significantly improved consistency in cross-lingual speech understanding and generation. Evaluation on the IWSLT 2025 benchmark confirms both effectiveness and strong generalization across diverse speech-language tasks and target languages.

Technology Category

Application Category

📝 Abstract

In this paper we describe NAVER LABS Europe submission to the instruction-following speech processing short track at IWSLT 2025. We participate in the constrained settings, developing systems that can simultaneously perform ASR, ST, and SQA tasks from English speech input into the following target languages: Chinese, Italian, and German. Our solution leverages two pretrained modules: (1) a speech-to-LLM embedding projector trained using representations from the SeamlessM4T-v2-large speech encoder; and (2) LoRA adapters trained on text data on top of a Llama-3.1-8B-Instruct. These modules are jointly loaded and further instruction-tuned for 1K steps on multilingual and multimodal data to form our final system submitted for evaluation.

Problem

Research questions and friction points this paper is trying to address.

Develops ASR, ST, SQA systems for English to multiple languages

Leverages pretrained speech-to-LLM and LoRA adapter modules

Instruction-tuned on multilingual multimodal data for evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages SeamlessM4T-v2-large speech encoder

Uses LoRA adapters on Llama-3.1-8B-Instruct

Jointly instruction-tuned on multilingual data

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs