🤖 AI Summary
Irony speech synthesis faces dual challenges: semantic incongruity and prosodic cue modeling—both largely unexplored in existing TTS systems. To address this, we propose the first end-to-end irony-aware TTS framework: it employs LoRA-finetuned LLaMA-3 to extract pragmatic embeddings capturing semantic contradictions, integrates a RAG mechanism to retrieve context-matched ironic prosodic exemplars, and jointly guides a VITS-based acoustic model via dual-path conditioning for natural, contextually appropriate ironic intonation. Our approach is the first to synergistically incorporate LLM fine-tuning and retrieval-augmented generation into irony TTS, eliminating reliance on explicit prosodic annotations. Experiments demonstrate significant improvements over baselines: +0.42 MOS (naturalness), +23.6% subjective irony expressiveness score, and +11.8% accuracy on downstream irony detection—validating both perceptual fidelity and functional utility.
📝 Abstract
Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.