Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the problem of monotonic alignment between parallel speech documents without textual transcriptions. We propose Speech Vecalign, a method that leverages speech segment embeddings and jointly exploits global and local mining strategies to achieve robust, long-range speech-to-speech alignment—fully unsupervised and text-free. The approach significantly improves alignment quality and span length while enhancing robustness to noise. Evaluated on 3,000 hours of English–German VoxPopuli speech data, it automatically constructs ~1,000 hours of high-quality parallel speech pairs. Speech translation models trained on this data achieve ASR-BLEU scores 0.37 and 0.18 points higher than baselines, matching the performance of SpeechMatrix despite using only 1/8 of its data volume. Our core contribution is the first high-accuracy, scalable, purely speech-driven monotonic alignment framework—enabling effective zero-text parallel corpus construction for speech translation.

Technology Category

Application Category

📝 Abstract

We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.

Problem

Research questions and friction points this paper is trying to address.

Aligning parallel speech documents without text transcriptions

Improving alignment length and reducing noise in speech mining

Enhancing speech-to-speech translation performance with fewer resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns speech embeddings without text transcriptions

Produces longer and less noisy alignments

Uses embedding-based method for parallel speech alignment

🔎 Similar Papers

No similar papers found.