Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Irony speech synthesis faces dual challenges: semantic incongruity and prosodic cue modeling—both largely unexplored in existing TTS systems. To address this, we propose the first end-to-end irony-aware TTS framework: it employs LoRA-finetuned LLaMA-3 to extract pragmatic embeddings capturing semantic contradictions, integrates a RAG mechanism to retrieve context-matched ironic prosodic exemplars, and jointly guides a VITS-based acoustic model via dual-path conditioning for natural, contextually appropriate ironic intonation. Our approach is the first to synergistically incorporate LLM fine-tuning and retrieval-augmented generation into irony TTS, eliminating reliance on explicit prosodic annotations. Experiments demonstrate significant improvements over baselines: +0.42 MOS (naturalness), +23.6% subjective irony expressiveness score, and +11.8% accuracy on downstream irony detection—validating both perceptual fidelity and functional utility.

Technology Category

Application Category

📝 Abstract

Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing sarcastic speech using nuanced semantic and prosodic cues

Addressing sarcasm's reliance on contextual and discourse-level information

Overcoming limitations of broad emotional categories in speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced semantic embeddings capture sarcastic cues

Retrieval-augmented generation provides prosodic exemplars

VITS backbone integrates dual conditioning for synthesis

🔎 Similar Papers

Reading with Intent