SpeechOp: Inference-Time Task Composition for Generative Speech Processing

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Data-scarce speech processing tasks—such as speech enhancement—are prone to content and speaker distortions when relying on generative models. To address this, we propose a multi-task latent diffusion framework that repurposes a pre-trained text-to-speech (TTS) model as a universal speech processor, enabling dynamic, inference-time composition of diverse tasks—including enhancement, denoising, and style transfer—without architectural modification. Crucially, we introduce automatic speech recognition (ASR) transcriptions (e.g., from Whisper) as implicit task-conditioning signals, thereby tightly coupling generative capability with linguistic understanding. Our approach requires no task-specific fine-tuning and achieves state-of-the-art performance in speech enhancement, significantly improving both content and speaker fidelity. Moreover, it induces reverse optimization of the underlying TTS model, enhancing its synthesis quality. This work advances end-to-end, high-fidelity speech processing by unifying generation and understanding within a single, adaptable diffusion-based paradigm.

Technology Category

Application Category

📝 Abstract

While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To bridge this gap, we present SpeechOp, a multi-task latent diffusion model that transforms pre-trained TTS models into a universal speech processor capable of performing a wide range of speech tasks and composing them in novel ways at inference time. By adapting a pre-trained TTS model, SpeechOp inherits a rich understanding of natural speech, accelerating training and improving S2S task quality, while simultaneously enhancing core TTS performance. Finally, we introduce Implicit Task Composition (ITC), a novel pipeline where ASR-derived transcripts (e.g., from Whisper) guide SpeechOp's enhancement via our principled inference-time task composition. ITC achieves state-of-the-art content preservation by robustly combining web-scale speech understanding with SpeechOp's generative capabilities. Audio samples are available at https://justinlovelace.github.io/projects/speechop

Problem

Research questions and friction points this paper is trying to address.

Addresses data limitations in generative speech enhancement tasks

Transforms TTS models into universal multi-task speech processors

Enables inference-time task composition for improved content preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts pre-trained TTS into universal speech processor

Uses multi-task latent diffusion model architecture

Implements inference-time task composition with ASR guidance

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection