🤖 AI Summary
This work addresses the problem of monotonic alignment between parallel speech documents without textual transcriptions. We propose Speech Vecalign, a method that leverages speech segment embeddings and jointly exploits global and local mining strategies to achieve robust, long-range speech-to-speech alignment—fully unsupervised and text-free. The approach significantly improves alignment quality and span length while enhancing robustness to noise. Evaluated on 3,000 hours of English–German VoxPopuli speech data, it automatically constructs ~1,000 hours of high-quality parallel speech pairs. Speech translation models trained on this data achieve ASR-BLEU scores 0.37 and 0.18 points higher than baselines, matching the performance of SpeechMatrix despite using only 1/8 of its data volume. Our core contribution is the first high-accuracy, scalable, purely speech-driven monotonic alignment framework—enabling effective zero-text parallel corpus construction for speech translation.
📝 Abstract
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.