Neural Speech Extraction with Human Feedback

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses the challenge in neural target speech extraction (TSE) systems of accurately modeling fine-grained speech distortions. We propose the first human-feedback-driven iterative optimization framework: users annotate only erroneous speech segments, and the system generates an editing mask to perform localized refinement while freezing unannotated regions to preserve semantic integrity and audio fidelity. To avoid costly real-world human annotations, we design an automatic mask generation strategy based on noise power (dBFS) and probability thresholds, and simulate annotation errors using synthetically corrupted data for efficient weakly supervised training. Experiments demonstrate strong agreement between predicted and manual masks (IoU = 0.82); a user study with 22 participants shows statistically significant preference for optimized outputs (p < 0.01); and objective evaluations confirm substantial improvements in speech intelligibility and target speaker fidelity.

Technology Category

Application Category

📝 Abstract

We present the first neural target speech extraction (TSE) system that uses human feedback for iterative refinement. Our approach allows users to mark specific segments of the TSE output, generating an edit mask. The refinement system then improves the marked sections while preserving unmarked regions. Since large-scale datasets of human-marked errors are difficult to collect, we generate synthetic datasets using various automated masking functions and train models on each. Evaluations show that models trained with noise power-based masking (in dBFS) and probabilistic thresholding perform best, aligning with human annotations. In a study with 22 participants, users showed a preference for refined outputs over baseline TSE. Our findings demonstrate that human-in-the-loop refinement is a promising approach for improving the performance of neural speech extraction.

Problem

Research questions and friction points this paper is trying to address.

Develop TSE system using human feedback for refinement

Generate synthetic datasets to train refinement models

Evaluate models with human preference for refined outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses human feedback for iterative refinement

Generates synthetic datasets for model training

Employs noise power-based masking for best performance

🔎 Similar Papers

No similar papers found.