🤖 AI Summary
This work addresses the limitation of existing multilingual speech-to-speech translation systems, which typically disregard the structural information of source languages and rely solely on flat language identifiers, thereby hindering generalization in low-resource scenarios. The authors propose S2ST-Omni 2, a novel framework that incorporates linguistic typology priors into the translation pipeline for the first time. By integrating hierarchical language encoding, a dynamic gating mechanism with language-aware Dual-CTC, and typology-informed prompts for large language models, the approach enables structured language-conditioned modeling at the representation, acoustic, and decoding levels. Experiments on the CVSS-C dataset demonstrate that the proposed method significantly outperforms current state-of-the-art systems in terms of BLEU and COMET scores, while maintaining strong performance even with only three hours of supervised data, highlighting its data efficiency and cross-lingual generalization capability.
📝 Abstract
Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.