🤖 AI Summary
To address the poor robustness of voice activity detection (VAD) under noisy and resource-constrained conditions, and the misalignment between conventional classification losses and evaluation metrics such as AUROC, this paper proposes a compact, efficient end-to-end VAD framework. Methodologically: (i) a learnable Sinc bandpass filter is employed to construct a noise-robust spectral frontend, enhancing feature discriminability; (ii) a novel Quadratic Difference Ranking Loss is introduced to explicitly optimize the relative ranking of speech versus non-speech frames, thereby directly maximizing AUROC. Experiments on multiple benchmark datasets demonstrate consistent improvements—AUROC increases by 1.2–2.8% and F2-score by 3.5–5.1%—while the model requires only 69% of the parameters of current state-of-the-art methods. The proposed approach thus achieves superior accuracy, low inference latency, and high parameter efficiency.
📝 Abstract
Voice activity detection (VAD) is essential for speech-driven applications, but remains far from perfect in noisy and resource-limited environments. Existing methods often lack robustness to noise, and their frame-wise classification losses are only loosely coupled with the evaluation metric of VAD. To address these challenges, we propose SincQDR-VAD, a compact and robust framework that combines a Sinc-extractor front-end with a novel quadratic disparity ranking loss. The Sinc-extractor uses learnable bandpass filters to capture noise-resistant spectral features, while the ranking loss optimizes the pairwise score order between speech and non-speech frames to improve the area under the receiver operating characteristic curve (AUROC). A series of experiments conducted on representative benchmark datasets show that our framework considerably improves both AUROC and F2-Score, while using only 69% of the parameters compared to prior arts, confirming its efficiency and practical viability.