Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low modeling efficiency and ambiguous fusion mechanisms in jointly leveraging handcrafted features (e.g., MFCCs) and pretrained model (PTM) features—such as wav2vec 2.0 and HuBERT—for voice activity detection (VAD). We propose FusionVAD, a lightweight unified framework. Methodologically, we systematically compare six PTM feature variants against MFCCs and empirically demonstrate— for the first time—that simple additive fusion substantially outperforms complex mechanisms (e.g., cross-attention) in both accuracy and computational efficiency, revealing strong complementarity between handcrafted and PTM features in time-frequency representation learning. Evaluated across multiple benchmark datasets, FusionVAD surpasses the state-of-the-art Pyannote with 2.04% higher average absolute performance and 3.2× faster inference speed, while using fewer parameters. The framework also exhibits significantly improved robustness and practical deployability.

Technology Category

Application Category

📝 Abstract
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MFCCs and PTM features for Voice Activity Detection
Proposing FusionVAD to combine features via simple fusion strategies
Demonstrating fusion models outperform single-feature and state-of-the-art methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines MFCCs and PTM features for VAD
Uses simple fusion strategies like addition
Outperforms state-of-the-art with 2.04% improvement
🔎 Similar Papers
No similar papers found.
K
Kumud Tripathi
Media Analysis Group, Sony Research India
C
Chowdam Venkata Kumar
Media Analysis Group, Sony Research India
Pankaj Wasnik
Pankaj Wasnik
Sony Research
Computer VisionBiometricsMachine TranslationSpeech Generation