🤖 AI Summary
This paper introduces expressive speech retrieval—a novel task that retrieves speech segments matching natural language descriptions of speaking styles (e.g., “excitedly”, “wearily”) rather than semantic content. Methodologically, it constructs, for the first time, a joint speech–text latent space using a dual-encoder architecture trained via contrastive learning to achieve cross-modal alignment, augmented with a prompt-enhancement strategy to improve generalization to open-domain style queries. Key contributions include: (1) formalizing and modeling fine-grained “how-to-say” style retrieval; (2) enabling free-text style queries without constrained vocabularies; and (3) proposing a scalable, prompt-based enhancement mechanism. Evaluated across multiple datasets covering 22 distinct speaking styles, the approach achieves significant gains in Recall@k over strong baselines, demonstrating improved accuracy and robustness in expressive speech retrieval.
📝 Abstract
We introduce the task of expressive speech retrieval, where the goal is to retrieve speech utterances spoken in a given style based on a natural language description of that style. While prior work has primarily focused on performing speech retrieval based on what was said in an utterance, we aim to do so based on how something was said. We train speech and text encoders to embed speech and text descriptions of speaking styles into a joint latent space, which enables using free-form text prompts describing emotions or styles as queries to retrieve matching expressive speech segments. We perform detailed analyses of various aspects of our proposed framework, including encoder architectures, training criteria for effective cross-modal alignment, and prompt augmentation for improved generalization to arbitrary text queries. Experiments on multiple datasets encompassing 22 speaking styles demonstrate that our approach achieves strong retrieval performance as measured by Recall@k.