Long-Context Speech Synthesis with Context-Aware Memory

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of paragraph-level contextual coherence and inconsistent prosody and timbre in long-context text-to-speech (TTS), this paper proposes an end-to-end synthesis framework based on Context-Aware Memory (CAM). The framework dynamically maintains both long-term memory and local context to enable cross-sentence information propagation, and introduces a prefix masking mechanism to support bidirectional contextual modeling under unidirectional autoregressive generation constraints. Its core innovation lies in the synergistic design of the CAM module and a dynamic memory update mechanism, effectively balancing long-range dependency modeling with inference efficiency. Experimental results demonstrate that the proposed method significantly outperforms existing long-context TTS approaches in prosodic expressiveness, paragraph-level coherence, and synthetic naturalness, while maintaining real-time inference speed.

Technology Category

Application Category

📝 Abstract
In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and state-of-the-art long-context methods in terms of prosody expressiveness, coherence and context inference cost across paragraph-level speech.
Problem

Research questions and friction points this paper is trying to address.

Addresses paragraph-level coherence loss in long-text speech synthesis
Resolves style and timbre inconsistencies across extended speech segments
Reduces context inference cost while improving prosody expressiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-Aware Memory for dynamic updates
Prefix mask enables bidirectional attention
Long-context TTS with memory transfer
🔎 Similar Papers
No similar papers found.