🤖 AI Summary
This work addresses the challenge of generating high-fidelity, relaxing ASMR speech for arbitrary speakers under zero-shot conditions—a task that existing text-to-speech systems struggle to accomplish. To this end, we propose DeepASMR, a novel framework capable of synthesizing target-speaker ASMR voice from only a single segment of ordinary read speech, without requiring any whisper or ASMR training data from that speaker. Our approach achieves the first zero-shot ASMR voice generation by softly decoupling ASMR style from speaker timbre through discrete speech tokens. We also introduce a large-scale multilingual ASMR corpus and a comprehensive evaluation protocol. The system employs a two-stage architecture: a large language model–based content-style encoder and a flow-matching–based acoustic decoder for timbre reconstruction. DeepASMR attains state-of-the-art performance in both ASMR naturalness and style fidelity while maintaining strong capabilities in conventional speech synthesis tasks.
📝 Abstract
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.