Direct Simultaneous Translation Activation for Large Audio-Language Models

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of simultaneous speech-to-text translation (Simul-S2TT) capability in large audio-language models (LALMs), without modifying model architecture or decoding strategies. We propose a lightweight activation method centered on SimulSA—a self-augmentation strategy that leverages the model’s own outputs to generate partially aligned synthetic simultaneous data. Using only 1% real simultaneous speech-text pairs, SimulSA bridges the distribution gap between pretraining and Simul-S2TT inference. Specifically, we randomly truncate audio segments and construct weakly aligned translation samples, then integrate these synthetic data into offline supervised fine-tuning (SFT). Experiments demonstrate consistent improvements across latency, translation quality, and output stability—without increasing parameters or altering inference logic. Our approach establishes a new paradigm for enabling real-time cross-modal translation in LALMs through zero-architecture modification.

Technology Category

Application Category

📝 Abstract
Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce {f Simul}taneous {f S}elf-{f A}ugmentation ({f SimulSA}), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about {f 1%} of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.
Problem

Research questions and friction points this paper is trying to address.

Activating simultaneous translation in large audio models
Bridging offline pretraining and real-time inference gap
Enabling Simul-S2TT without architectural modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous Self-Augmentation strategy
Truncates speech for partial alignment
Augments minimal data without architectural changes
🔎 Similar Papers
No similar papers found.
P
Pei Zhang
Tongyi Lab, Alibaba Group
Y
Yiming Wang
School of Computer Science, Shanghai Jiao Tong University
Jialong Tang
Jialong Tang
Qwen Team, Alibaba
LLMNLP
Baosong Yang
Baosong Yang
Alibaba-inc
Machine LearningLarge Language ModelMachine Translation
R
Rui Wang
School of Computer Science, Shanghai Jiao Tong University
Derek F. Wong
Derek F. Wong
Professor, Department of Computer and Information Science, University of Macau
Machine TranslationNeural Machine TranslationNatural Language ProcessingMachine Learning
F
Fei Huang
Tongyi Lab, Alibaba Group