DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

End-to-end speech translation (E2E-ST) faces a fundamental challenge of cross-modal representation misalignment between speech and text, particularly in low-resource settings where word-level alignment annotations are unavailable. This hinders precise semantic alignment across modalities. To address this, we propose the first integration of dynamic time warping (DTW) into the embedding-layer alignment training of E2E-ST—requiring no external alignment tools or forced-alignment modules—and directly optimize sequence-level semantic consistency between speech and text embeddings. Our method is jointly trained with end-to-end neural architectures and achieves significant improvements over prior approaches across five low-resource configurations spanning six language directions, while also enhancing training efficiency. The core contribution is an unsupervised, differentiable DTW-based embedding alignment mechanism that effectively bridges the modality gap, establishing a novel paradigm for low-resource E2E-ST.

Technology Category

Application Category

📝 Abstract

End-to-End Speech Translation (E2E-ST) is the task of translating source speech directly into target text bypassing the intermediate transcription step. The representation discrepancy between the speech and text modalities has motivated research on what is known as bridging the modality gap. State-of-the-art methods addressed this by aligning speech and text representations on the word or token level. Unfortunately, this requires an alignment tool that is not available for all languages. Although this issue has been addressed by aligning speech and text embeddings using nearest-neighbor similarity search, it does not lead to accurate alignments. In this work, we adapt Dynamic Time Warping (DTW) for aligning speech and text embeddings during training. Our experiments demonstrate the effectiveness of our method in bridging the modality gap in E2E-ST. Compared to previous work, our method produces more accurate alignments and achieves comparable E2E-ST results while being significantly faster. Furthermore, our method outperforms previous work in low resource settings on 5 out of 6 language directions.

Problem

Research questions and friction points this paper is trying to address.

Addressing modality gap between speech and text in end-to-end translation

Improving alignment accuracy without requiring external alignment tools

Enhancing translation performance in low-resource language settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Dynamic Time Warping for modality alignment

Aligns speech and text embeddings during training

Achieves faster, more accurate alignments without external tools

🔎 Similar Papers

No similar papers found.