Borderless Long Speech Synthesis

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing speech synthesis systems struggle to model multi-speaker interactions, emotional dynamics, and acoustic environment variations in real-world scenarios due to their limited capacity for unified understanding of global context and paralinguistic cues. This work proposes an end-to-end long-form speech synthesis framework that establishes an information-rich, wideband control channel through a Global-Sentence-Token multi-level annotation scheme and a native agent architecture, enabling seamless conversion from multimodal inputs to structured vocal directives. By integrating a continuous tokenizer backbone, chain-of-thought reasoning, and dimension dropout techniques—alongside a top-down hierarchical labeling strategy—the method substantially enhances instruction-following fidelity under complex conditions, achieving boundary-free, coherent long-form speech generation that effectively captures realistic interactions involving multiple speakers, evolving emotions, and dynamic environmental contexts.

Technology Category

Application Category

📝 Abstract

Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

long speech synthesis

multi-speaker interaction

global context

paralinguistic cues

emotional arcs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Borderless Long Speech Synthesis

Global-Sentence-Token Annotation

Chain-of-Thought Reasoning

Native Agentic TTS

Structured Semantic Interface

🔎 Similar Papers

No similar papers found.