Borderless Long Speech Synthesis

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech synthesis systems struggle to model multi-speaker interactions, emotional dynamics, and acoustic environment variations in real-world scenarios due to their limited capacity for unified understanding of global context and paralinguistic cues. This work proposes an end-to-end long-form speech synthesis framework that establishes an information-rich, wideband control channel through a Global-Sentence-Token multi-level annotation scheme and a native agent architecture, enabling seamless conversion from multimodal inputs to structured vocal directives. By integrating a continuous tokenizer backbone, chain-of-thought reasoning, and dimension dropout techniques—alongside a top-down hierarchical labeling strategy—the method substantially enhances instruction-following fidelity under complex conditions, achieving boundary-free, coherent long-form speech generation that effectively captures realistic interactions involving multiple speakers, evolving emotions, and dynamic environmental contexts.

Technology Category

Application Category

📝 Abstract
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

long speech synthesis
multi-speaker interaction
global context
paralinguistic cues
emotional arcs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Borderless Long Speech Synthesis
Global-Sentence-Token Annotation
Chain-of-Thought Reasoning
Native Agentic TTS
Structured Semantic Interface
🔎 Similar Papers
No similar papers found.
X
Xingchen Song
MiLM Plus, Xiaomi Inc., China
D
Di Wu
MiLM Plus, Xiaomi Inc., China
D
Dinghao Zhou
MiLM Plus, Xiaomi Inc., China
Pengyu Cheng
Pengyu Cheng
Alibaba Group
machine learningnatural language processing
H
Hongwu Ding
MiLM Plus, Xiaomi Inc., China
Yunchao He
Yunchao He
Xiaomi Inc.
Speech Synthesis
J
Jie Wang
MiLM Plus, Xiaomi Inc., China
S
Shengfan Shen
MiLM Plus, Xiaomi Inc., China
S
Sixiang Lv
MiLM Plus, Xiaomi Inc., China
L
Lichun Fan
MiLM Plus, Xiaomi Inc., China
H
Hang Su
MiLM Plus, Xiaomi Inc., China
Y
Yifeng Wang
MiLM Plus, Xiaomi Inc., China
Shuai Wang
Shuai Wang
Nanjing University
AI
Meng Meng
Meng Meng
Associate Professor, University of Bath
Sustainable transportNetwork modelling and optimisationTravel behaviour analysis
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis