Synthetic Audio Helps for Cognitive State Tasks

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Cognitive state recognition in NLP has traditionally relied solely on textual input, neglecting paralinguistic cues such as prosody. Method: This paper introduces SAD, a multimodal framework that leverages zero-shot synthetic speech—generated by off-the-shelf text-to-speech (TTS) systems—to extract cognitively informative, transferable, and text-orthogonal acoustic signals. SAD employs multimodal fine-tuning and cross-modal feature fusion to jointly train classification/regression models on both text and synthetic audio. Contribution/Results: Evaluated on seven cognitive state prediction tasks, SAD significantly outperforms text-only baselines. Notably, on tasks with ground-truth audio, SAD using only zero-shot TTS audio achieves performance on par with models using real audio plus text. This work establishes a practical, privacy-preserving, and resource-efficient paradigm for multimodal cognitive modeling—particularly valuable in low-resource or privacy-sensitive settings.

Technology Category

Application Category

📝 Abstract

The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.

Problem

Research questions and friction points this paper is trying to address.

Enhance cognitive state tasks with synthetic audio.

Combine text and synthetic audio for better results.

Improve performance over text-only cognitive modeling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Audio Data fine-tuning

Multimodal text-audio training

Off-the-shelf TTS system

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones