🤖 AI Summary
This work addresses the low modeling efficiency and ambiguous fusion mechanisms in jointly leveraging handcrafted features (e.g., MFCCs) and pretrained model (PTM) features—such as wav2vec 2.0 and HuBERT—for voice activity detection (VAD). We propose FusionVAD, a lightweight unified framework. Methodologically, we systematically compare six PTM feature variants against MFCCs and empirically demonstrate— for the first time—that simple additive fusion substantially outperforms complex mechanisms (e.g., cross-attention) in both accuracy and computational efficiency, revealing strong complementarity between handcrafted and PTM features in time-frequency representation learning. Evaluated across multiple benchmark datasets, FusionVAD surpasses the state-of-the-art Pyannote with 2.04% higher average absolute performance and 3.2× faster inference speed, while using fewer parameters. The framework also exhibits significantly improved robustness and practical deployability.
📝 Abstract
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.