DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end speech translation (E2E-ST) faces a fundamental challenge of cross-modal representation misalignment between speech and text, particularly in low-resource settings where word-level alignment annotations are unavailable. This hinders precise semantic alignment across modalities. To address this, we propose the first integration of dynamic time warping (DTW) into the embedding-layer alignment training of E2E-ST—requiring no external alignment tools or forced-alignment modules—and directly optimize sequence-level semantic consistency between speech and text embeddings. Our method is jointly trained with end-to-end neural architectures and achieves significant improvements over prior approaches across five low-resource configurations spanning six language directions, while also enhancing training efficiency. The core contribution is an unsupervised, differentiable DTW-based embedding alignment mechanism that effectively bridges the modality gap, establishing a novel paradigm for low-resource E2E-ST.

Technology Category

Application Category

📝 Abstract
End-to-End Speech Translation (E2E-ST) is the task of translating source speech directly into target text bypassing the intermediate transcription step. The representation discrepancy between the speech and text modalities has motivated research on what is known as bridging the modality gap. State-of-the-art methods addressed this by aligning speech and text representations on the word or token level. Unfortunately, this requires an alignment tool that is not available for all languages. Although this issue has been addressed by aligning speech and text embeddings using nearest-neighbor similarity search, it does not lead to accurate alignments. In this work, we adapt Dynamic Time Warping (DTW) for aligning speech and text embeddings during training. Our experiments demonstrate the effectiveness of our method in bridging the modality gap in E2E-ST. Compared to previous work, our method produces more accurate alignments and achieves comparable E2E-ST results while being significantly faster. Furthermore, our method outperforms previous work in low resource settings on 5 out of 6 language directions.
Problem

Research questions and friction points this paper is trying to address.

Addressing modality gap between speech and text in end-to-end translation
Improving alignment accuracy without requiring external alignment tools
Enhancing translation performance in low-resource language settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Dynamic Time Warping for modality alignment
Aligns speech and text embeddings during training
Achieves faster, more accurate alignments without external tools
🔎 Similar Papers
No similar papers found.
A
Abderrahmane Issam
Department of Advanced Computing Sciences, Maastricht University
Yusuf Can Semerci
Yusuf Can Semerci
Assistant Professor, Maastricht University
Human Activity RecognitionAffective ComputingEducational Computing
J
Jan Scholtes
Department of Advanced Computing Sciences, Maastricht University
Gerasimos Spanakis
Gerasimos Spanakis
Maastricht University
Assistant Professor