Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly modeling multiple non-lexical acoustic-semantic styles (e.g., speaker identity, emotion) in unsupervised speech segmentation. We propose a general unsupervised segmentation framework based on Speech Language Models (SLMs), the first to leverage SLMs for this task. Our method extracts robust acoustic-semantic joint representations via an SLM and integrates them with unsupervised boundary detection and representation clustering to achieve fine-grained, style-aware segmentation—without requiring any text transcription. Unlike conventional approaches that model only a single style, our framework enables multi-style joint modeling. Experiments demonstrate significant improvements over state-of-the-art baselines across key metrics: boundary detection accuracy, segment purity, and over-segmentation rate. These results validate both the effectiveness of our approach and its strong cross-style generalization capability.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised Learning
Speech Segmentation
Non-linguistic Features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Learning
Speech Segmentation
Advanced Speech Understanding
🔎 Similar Papers
No similar papers found.