🤖 AI Summary
Sarcasm speech synthesis faces challenges including subtle prosodic variations, scarcity of annotated data, and difficulty in pragmatic modeling. Method: This paper proposes a TTS training framework integrating dual-modal sarcasm detection feedback loss. It jointly optimizes a pre-trained TTS model with a text–speech multimodal sarcasm detector, incorporating the detector’s output as an auxiliary feedback loss. A two-stage fine-tuning strategy, combined with cross-style transfer learning, enables co-modeling of sarcastic semantics and prosody on a multi-style sarcastic speech dataset. Contribution/Results: Experiments demonstrate significant improvements over baselines: +0.42 in naturalness (MOS), +0.38 in speech quality (CMOS), and +12.6% in sarcasm perception accuracy (human evaluation). The method effectively enhances both audibility and natural expressivity of sarcastic intent.
📝 Abstract
Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.