🤖 AI Summary
Existing speech-language models discretize speech into phonemes or semantic tokens before feeding them into large language models (LLMs), resulting in severe loss of prosodic information—such as intonation, stress, and emotion—and consequently limiting pre-trained models’ ability to understand and generate prosody. Method: We propose a word-level prosody–text joint modeling framework, introducing a novel tokenization scheme that preserves full prosodic structure by converting speech into hybrid sequences of “text + word-level prosodic tokens,” directly fed into standard LLMs for end-to-end pre-training. Contribution/Results: Our approach is the first to enable emergent prosodic capabilities—including contrastive stress recognition, fine-grained emotion discrimination, and long-text prosodic consistency modeling—solely through pre-training. Experiments demonstrate significant improvements over conventional discrete methods on both prosody perception and generation tasks, establishing a new paradigm for prosodic intelligence in speech-language models.
📝 Abstract
Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.