Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Sarcasm speech synthesis faces challenges including subtle prosodic variations, scarcity of annotated data, and difficulty in pragmatic modeling. Method: This paper proposes a TTS training framework integrating dual-modal sarcasm detection feedback loss. It jointly optimizes a pre-trained TTS model with a text–speech multimodal sarcasm detector, incorporating the detector’s output as an auxiliary feedback loss. A two-stage fine-tuning strategy, combined with cross-style transfer learning, enables co-modeling of sarcastic semantics and prosody on a multi-style sarcastic speech dataset. Contribution/Results: Experiments demonstrate significant improvements over baselines: +0.42 in naturalness (MOS), +0.38 in speech quality (CMOS), and +12.6% in sarcasm perception accuracy (human evaluation). The method effectively enhances both audibility and natural expressivity of sarcastic intent.

Technology Category

Application Category

📝 Abstract

Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.

Problem

Research questions and friction points this paper is trying to address.

Challenges in synthesizing sarcastic speech due to nuanced prosody

Limited availability of annotated sarcastic speech data

Need for improved sarcasm-awareness in speech synthesis models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates feedback loss from bi-modal detector

Uses two-stage fine-tuning with transfer learning

Enhances sarcasm-awareness in speech synthesis

🔎 Similar Papers

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection