DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of generating high-fidelity, relaxing ASMR speech for arbitrary speakers under zero-shot conditions—a task that existing text-to-speech systems struggle to accomplish. To this end, we propose DeepASMR, a novel framework capable of synthesizing target-speaker ASMR voice from only a single segment of ordinary read speech, without requiring any whisper or ASMR training data from that speaker. Our approach achieves the first zero-shot ASMR voice generation by softly decoupling ASMR style from speaker timbre through discrete speech tokens. We also introduce a large-scale multilingual ASMR corpus and a comprehensive evaluation protocol. The system employs a two-stage architecture: a large language model–based content-style encoder and a flow-matching–based acoustic decoder for timbre reconstruction. DeepASMR attains state-of-the-art performance in both ASMR naturalness and style fidelity while maintaining strong capabilities in conventional speech synthesis tasks.

Technology Category

Application Category

📝 Abstract

While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

ASMR

zero-shot

text-to-speech

voice adaptation

low-intensity speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot ASMR generation

Discrete speech tokens

Large Language Model (LLM)

Flow-matching acoustic decoder

Speaker timbre disentanglement

🔎 Similar Papers

No similar papers found.