Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of monotonic alignment between parallel speech documents without textual transcriptions. We propose Speech Vecalign, a method that leverages speech segment embeddings and jointly exploits global and local mining strategies to achieve robust, long-range speech-to-speech alignment—fully unsupervised and text-free. The approach significantly improves alignment quality and span length while enhancing robustness to noise. Evaluated on 3,000 hours of English–German VoxPopuli speech data, it automatically constructs ~1,000 hours of high-quality parallel speech pairs. Speech translation models trained on this data achieve ASR-BLEU scores 0.37 and 0.18 points higher than baselines, matching the performance of SpeechMatrix despite using only 1/8 of its data volume. Our core contribution is the first high-accuracy, scalable, purely speech-driven monotonic alignment framework—enabling effective zero-text parallel corpus construction for speech translation.

Technology Category

Application Category

📝 Abstract
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.
Problem

Research questions and friction points this paper is trying to address.

Aligning parallel speech documents without text transcriptions
Improving alignment length and reducing noise in speech mining
Enhancing speech-to-speech translation performance with fewer resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns speech embeddings without text transcriptions
Produces longer and less noisy alignments
Uses embedding-based method for parallel speech alignment
🔎 Similar Papers
No similar papers found.