KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address insufficient semantic understanding and generation capabilities in the IWSLT 2025 offline speech translation (ST) and instruction-following tasks, this work proposes an LLM-augmented multi-stage speech-to-text translation framework. Methodologically, it introduces document-level context-aware ASR output ensembling coupled with a “two-step translation + refinement” paradigm; additionally, it designs the first unified end-to-end instruction-following architecture for speech input, integrating joint fine-tuning of a speech encoder and LLM with document-level post-editing. Contributions include: (i) the first systematic incorporation of document-level contextual modeling into offline ST, and (ii) the first speech-native LLM framework supporting diverse instruction execution. Experiments demonstrate state-of-the-art performance on both IWSLT 2025 tracks—achieving significant BLEU improvements and markedly higher instruction-following accuracy—validating the efficacy of LLM-driven speech semantic understanding and controllable generation.

Technology Category

Application Category

📝 Abstract

The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.

Problem

Research questions and friction points this paper is trying to address.

Enhancing offline speech translation using LLMs and document-level context

Developing end-to-end instruction-following models with speech-LLM integration

Improving translation and task output quality via multi-step refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage LLMs for offline speech translation enhancement

Fuse ASR outputs using LLM with document-level context

Integrate speech encoder with LLM for instruction following

🔎 Similar Papers

No similar papers found.