🤖 AI Summary
This work addresses unsupervised word segmentation and vocabulary induction from untranscribed speech. We propose a lightweight two-stage approach: first, leveraging self-supervised speech representations (wav2vec 2.0), we compute frame-wise adjacency differences and estimate word boundary confidence via sliding-window aggregation; second, we apply K-means clustering to the segmented units to induce a vocabulary. Departing from conventional dynamic programming–based optimization frameworks, our method adopts an interpretable, low-overhead paradigm of “boundary detection + clustering.” Evaluated on the ZeroSpeech 2019 zero-resource benchmark across five languages, our approach achieves ABX error rates competitive with the state-of-the-art ES-KMeans+ method, while accelerating inference by 4.8× and substantially reducing computational cost—demonstrating both high accuracy and high efficiency.
📝 Abstract
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster. Project webpage: https://s-malan.github.io/prom-seg-clus.