Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sarcasm speech synthesis faces challenges including subtle prosodic variations, scarcity of annotated data, and difficulty in pragmatic modeling. Method: This paper proposes a TTS training framework integrating dual-modal sarcasm detection feedback loss. It jointly optimizes a pre-trained TTS model with a text–speech multimodal sarcasm detector, incorporating the detector’s output as an auxiliary feedback loss. A two-stage fine-tuning strategy, combined with cross-style transfer learning, enables co-modeling of sarcastic semantics and prosody on a multi-style sarcastic speech dataset. Contribution/Results: Experiments demonstrate significant improvements over baselines: +0.42 in naturalness (MOS), +0.38 in speech quality (CMOS), and +12.6% in sarcasm perception accuracy (human evaluation). The method effectively enhances both audibility and natural expressivity of sarcastic intent.

Technology Category

Application Category

📝 Abstract
Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.
Problem

Research questions and friction points this paper is trying to address.

Challenges in synthesizing sarcastic speech due to nuanced prosody
Limited availability of annotated sarcastic speech data
Need for improved sarcasm-awareness in speech synthesis models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates feedback loss from bi-modal detector
Uses two-stage fine-tuning with transfer learning
Enhances sarcasm-awareness in speech synthesis
🔎 Similar Papers
No similar papers found.