DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-fidelity, relaxing ASMR speech for arbitrary speakers under zero-shot conditions—a task that existing text-to-speech systems struggle to accomplish. To this end, we propose DeepASMR, a novel framework capable of synthesizing target-speaker ASMR voice from only a single segment of ordinary read speech, without requiring any whisper or ASMR training data from that speaker. Our approach achieves the first zero-shot ASMR voice generation by softly decoupling ASMR style from speaker timbre through discrete speech tokens. We also introduce a large-scale multilingual ASMR corpus and a comprehensive evaluation protocol. The system employs a two-stage architecture: a large language model–based content-style encoder and a flow-matching–based acoustic decoder for timbre reconstruction. DeepASMR attains state-of-the-art performance in both ASMR naturalness and style fidelity while maintaining strong capabilities in conventional speech synthesis tasks.

Technology Category

Application Category

📝 Abstract
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
Problem

Research questions and friction points this paper is trying to address.

ASMR
zero-shot
text-to-speech
voice adaptation
low-intensity speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot ASMR generation
Discrete speech tokens
Large Language Model (LLM)
Flow-matching acoustic decoder
Speaker timbre disentanglement
🔎 Similar Papers
No similar papers found.
L
Leying Zhang
Auditory Cognition and Computational Acoustics Lab, School of Computer Science & MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, 200240 P. R. China
T
Tingxiao Zhou
Auditory Cognition and Computational Acoustics Lab, School of Computer Science & MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, 200240 P. R. China
Haiyang Sun
Haiyang Sun
Shanghai Jiao Tong University
Mengxiao Bi
Mengxiao Bi
Fuxi AI Lab, NetEase Inc.
Deep Learning
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning