SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Novice content creators struggle with inefficient speech synthesis due to the complexity of text-to-speech (TTS) interfaces and limited control over expressive prosody. Method: We propose SpeakEasy—a high-level, context-driven TTS system designed for beginners—replacing low-level parameter tuning with abstract semantic dimensions (e.g., emotion, pacing, attitude) and introducing a Wizard-of-Oz human-in-the-loop paradigm grounded in professional voice-acting practice and real-world creative workflows. SpeakEasy incorporates context-aware prompting and an iterative refinement framework guided by user feedback. Contribution/Results: Evaluated through two qualitative, eight-participant human factors studies, SpeakEasy significantly improved users’ success rate in generating speech aligned with intended expression—outperforming mainstream industrial TTS interfaces—without increasing cognitive or operational load. This work establishes a novel methodology and scalable design paradigm for accessible, expressive speech synthesis.

Technology Category

Application Category

📝 Abstract
Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.
Problem

Research questions and friction points this paper is trying to address.

Simplifying TTS interfaces for novice content creators
Enhancing expressive speech generation with high-level context
Reducing effort in achieving personalized TTS performances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplifying TTS with high-level context input
Leveraging user context to influence TTS output
Enabling iterative refinement with high-level feedback
🔎 Similar Papers
No similar papers found.