🤖 AI Summary
Existing end-to-end speech generation models suffer from poor coherence, training instability, and excessive memory consumption when modeling long-duration speech (>1 minute), limiting their applicability in scenarios such as long-video dubbing and voice assistants. To address this, we propose SpeechSSM—the first end-to-end framework for long-form speech generation based on State Space Models (SSMs), capable of directly modeling continuous speech token sequences up to 16 minutes without intermediate text representations. Our method integrates high-fidelity acoustic tokenization, long-context-optimized training, and linear-complexity sequence modeling. Key contributions include: (1) the first comprehensive evaluation suite for long-form speech generation—incorporating embedding-based metrics, LLM-based assessment, and duration-aware analysis—along with the new benchmark LibriSpeech-Long; (2) substantial improvements in speech naturalness and cross-minute coherence on this benchmark; and (3) full open-sourcing of all models and datasets.
📝 Abstract
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at https://google.github.io/tacotron/publications/speechssm/