Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper introduces expressive speech retrieval—a novel task that retrieves speech segments matching natural language descriptions of speaking styles (e.g., “excitedly”, “wearily”) rather than semantic content. Methodologically, it constructs, for the first time, a joint speech–text latent space using a dual-encoder architecture trained via contrastive learning to achieve cross-modal alignment, augmented with a prompt-enhancement strategy to improve generalization to open-domain style queries. Key contributions include: (1) formalizing and modeling fine-grained “how-to-say” style retrieval; (2) enabling free-text style queries without constrained vocabularies; and (3) proposing a scalable, prompt-based enhancement mechanism. Evaluated across multiple datasets covering 22 distinct speaking styles, the approach achieves significant gains in Recall@k over strong baselines, demonstrating improved accuracy and robustness in expressive speech retrieval.

Technology Category

Application Category

📝 Abstract

We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.

Problem

Research questions and friction points this paper is trying to address.

Retrieve speech by style using natural language descriptions

Align speech and text in a joint latent space

Improve generalization for arbitrary style queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint latent space for speech and text

Cross-modal alignment training criteria

Prompt augmentation for generalization

🔎 Similar Papers

No similar papers found.