SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness of voice activity detection (VAD) under noisy and resource-constrained conditions, and the misalignment between conventional classification losses and evaluation metrics such as AUROC, this paper proposes a compact, efficient end-to-end VAD framework. Methodologically: (i) a learnable Sinc bandpass filter is employed to construct a noise-robust spectral frontend, enhancing feature discriminability; (ii) a novel Quadratic Difference Ranking Loss is introduced to explicitly optimize the relative ranking of speech versus non-speech frames, thereby directly maximizing AUROC. Experiments on multiple benchmark datasets demonstrate consistent improvements—AUROC increases by 1.2–2.8% and F2-score by 3.5–5.1%—while the model requires only 69% of the parameters of current state-of-the-art methods. The proposed approach thus achieves superior accuracy, low inference latency, and high parameter efficiency.

Technology Category

Application Category

📝 Abstract
Voice activity detection (VAD) is essential for speech-driven applications, but remains far from perfect in noisy and resource-limited environments. Existing methods often lack robustness to noise, and their frame-wise classification losses are only loosely coupled with the evaluation metric of VAD. To address these challenges, we propose SincQDR-VAD, a compact and robust framework that combines a Sinc-extractor front-end with a novel quadratic disparity ranking loss. The Sinc-extractor uses learnable bandpass filters to capture noise-resistant spectral features, while the ranking loss optimizes the pairwise score order between speech and non-speech frames to improve the area under the receiver operating characteristic curve (AUROC). A series of experiments conducted on representative benchmark datasets show that our framework considerably improves both AUROC and F2-Score, while using only 69% of the parameters compared to prior arts, confirming its efficiency and practical viability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing noise robustness in voice activity detection
Improving frame-wise classification with ranking-aware optimization
Reducing model parameters while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable bandpass filters for noise-resistant features
Quadratic disparity ranking loss for AUROC optimization
Compact framework with reduced parameter usage
🔎 Similar Papers
No similar papers found.
Chien-Chun Wang
Chien-Chun Wang
National Taiwan Normal University
Speech EnhancementSpeech RecognitionVoice Activity DetectionSpeech Quality Assessment
En-Lun Yu
En-Lun Yu
National Taiwan Normal University
Speech EnhancementPersonal Voice Activity DetectionVoice Activity Detection
J
Jeih-Weih Hung
National Chi Nan University, Taiwan
S
Shih-Chieh Huang
Realtek Semiconductor Corp., Taiwan
B
Berlin Chen
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan