Long-Form Speech Generation with Spoken Language Models

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end speech generation models suffer from poor coherence, training instability, and excessive memory consumption when modeling long-duration speech (>1 minute), limiting their applicability in scenarios such as long-video dubbing and voice assistants. To address this, we propose SpeechSSM—the first end-to-end framework for long-form speech generation based on State Space Models (SSMs), capable of directly modeling continuous speech token sequences up to 16 minutes without intermediate text representations. Our method integrates high-fidelity acoustic tokenization, long-context-optimized training, and linear-complexity sequence modeling. Key contributions include: (1) the first comprehensive evaluation suite for long-form speech generation—incorporating embedding-based metrics, LLM-based assessment, and duration-aware analysis—along with the new benchmark LibriSpeech-Long; (2) substantial improvements in speech naturalness and cross-minute coherence on this benchmark; and (3) full open-sourcing of all models and datasets.

Technology Category

Application Category

📝 Abstract
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/
Problem

Research questions and friction points this paper is trying to address.

Long-form Speech Generation
Quality Degradation
Memory Consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

SpeechSSM
Text-independent Speech Synthesis
LibriSpeech-Long Benchmark
🔎 Similar Papers
No similar papers found.
Se Jin Park
Se Jin Park
Korea Advanced Institute of Science and Technology (KAIST)
multimodal learningimage/video generationspeech processing
J
Julián Salazar
Google DeepMind
A
A. Jansen
Google DeepMind
Keisuke Kinoshita
Keisuke Kinoshita
Research Scientist at Google
Y
Y. Ro
Integrated Vision and Language Lab, KAIST
R
R. Skerry-Ryan
Google DeepMind