Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the misalignment between conventional optimization objectives and human auditory preferences in generative speech enhancement. We propose a preference-aligned method leveraging language model architectures, introducing Direct Preference Optimization (DPO) to speech enhancement for the first time. To enable scalable, annotation-free preference supervision, we adopt UTMOS as a proxy perceptual feedback model. Unlike standard mean-squared-error or likelihood-based training paradigms, our approach directly optimizes for human subjective quality preferences. Experiments on the DNS2020 test set demonstrate that the proposed method achieves up to a 56% relative improvement over baseline models in objective metrics—including PESQ, STOI, and VISQOL—while markedly enhancing speech naturalness and intelligibility. These results validate the effectiveness and generalizability of perception-driven optimization in generative speech enhancement.

Technology Category

Application Category

📝 Abstract

This work investigates speech enhancement (SE) from the perspective of language models (LMs). We propose a novel method that leverages Direct Preference Optimization (DPO) to improve the perceptual quality of enhanced speech. Using UTMOS, a neural MOS prediction model, as a proxy for human ratings, our approach guides optimization toward perceptually preferred outputs. This differs from existing LM-based SE methods that focus on maximizing the likelihood of clean speech tokens, which may misalign with human perception and degrade quality despite low prediction error. Experiments on the 2020 Deep Noise Suppression Challenge test sets demonstrate that applying DPO to a pretrained LM-based SE model yields consistent improvements across various speech quality metrics, with relative gains of up to 56%. To our knowledge, this is the first application of DPO to SE and the first to incorporate proxy perceptual feedback into LM-based SE training, pointing to a promising direction for perceptually aligned SE.

Problem

Research questions and friction points this paper is trying to address.

Aligns speech enhancement with human perceptual preferences

Uses DPO to improve quality over likelihood-based LM methods

Incorporates proxy feedback (UTMOS) for perceptual optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Direct Preference Optimization for speech enhancement

Leverages UTMOS as proxy for human perceptual ratings

Applies DPO to pretrained LM-based SE model

🔎 Similar Papers

No similar papers found.