Aligning Generative Speech Enhancement with Human Preferences via Direct Preference Optimization

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the misalignment between conventional optimization objectives and human auditory preferences in generative speech enhancement. We propose a preference-aligned method leveraging language model architectures, introducing Direct Preference Optimization (DPO) to speech enhancement for the first time. To enable scalable, annotation-free preference supervision, we adopt UTMOS as a proxy perceptual feedback model. Unlike standard mean-squared-error or likelihood-based training paradigms, our approach directly optimizes for human subjective quality preferences. Experiments on the DNS2020 test set demonstrate that the proposed method achieves up to a 56% relative improvement over baseline models in objective metrics—including PESQ, STOI, and VISQOL—while markedly enhancing speech naturalness and intelligibility. These results validate the effectiveness and generalizability of perception-driven optimization in generative speech enhancement.

Technology Category

Application Category

📝 Abstract
This work investigates speech enhancement (SE) from the perspective of language models (LMs). We propose a novel method that leverages Direct Preference Optimization (DPO) to improve the perceptual quality of enhanced speech. Using UTMOS, a neural MOS prediction model, as a proxy for human ratings, our approach guides optimization toward perceptually preferred outputs. This differs from existing LM-based SE methods that focus on maximizing the likelihood of clean speech tokens, which may misalign with human perception and degrade quality despite low prediction error. Experiments on the 2020 Deep Noise Suppression Challenge test sets demonstrate that applying DPO to a pretrained LM-based SE model yields consistent improvements across various speech quality metrics, with relative gains of up to 56%. To our knowledge, this is the first application of DPO to SE and the first to incorporate proxy perceptual feedback into LM-based SE training, pointing to a promising direction for perceptually aligned SE.
Problem

Research questions and friction points this paper is trying to address.

Aligns speech enhancement with human perceptual preferences
Uses DPO to improve quality over likelihood-based LM methods
Incorporates proxy feedback (UTMOS) for perceptual optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Direct Preference Optimization for speech enhancement
Leverages UTMOS as proxy for human perceptual ratings
Applies DPO to pretrained LM-based SE model
🔎 Similar Papers
No similar papers found.
H
Haoyang Li
Nanyang Technological University, Singapore
Nana Hou
Nana Hou
ZOOM | Ph.D. at Nanyang Technological University, Singapore
SpeechDeep Learning
Y
Yuchen Hu
Nanyang Technological University, Singapore
J
Jixun Yao
Northwestern Polytechnical University, China
S
Sabato Marco Siniscalchi
University of Palermo, Italy
E
Eng Siong Chng
Nanyang Technological University, Singapore