Multi-Metric Preference Alignment for Generative Speech Restoration

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

In generative speech restoration, misalignment between training objectives and human perceptual preferences often limits reconstruction quality. To address this, we propose a multi-metric preference alignment strategy and introduce GenSR-Pref—the first large-scale speech restoration preference dataset (80K preference pairs)—that jointly models perceptual quality, signal fidelity, content consistency, and timbre preservation, thereby mitigating reward hacking induced by single-metric optimization. We further pioneer the application of preference-based post-training alignment—specifically Direct Preference Optimization (DPO)—to generative speech restoration, extending it to autoregressive (AR), masked generative modeling (MGM), and flow matching (FM) architectures. Aligned models then serve as “data annotators” to generate high-quality pseudo-labels, alleviating data scarcity in discriminative model training. Experiments demonstrate consistent and significant improvements in both objective and subjective metrics across mainstream benchmarks. Ablation studies validate the effectiveness of our multi-metric design.

Technology Category

Application Category

📝 Abstract

Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:https://gensr-pref.github.io

Problem

Research questions and friction points this paper is trying to address.

Aligning generative speech models with human perceptual preferences

Defining robust preference signals to avoid reward hacking

Creating high-quality datasets for multi-metric preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-metric preference alignment strategy

DPO with GenSR-Pref dataset

Holistic metrics prevent reward hacking

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech