Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the low modeling efficiency and ambiguous fusion mechanisms in jointly leveraging handcrafted features (e.g., MFCCs) and pretrained model (PTM) features—such as wav2vec 2.0 and HuBERT—for voice activity detection (VAD). We propose FusionVAD, a lightweight unified framework. Methodologically, we systematically compare six PTM feature variants against MFCCs and empirically demonstrate— for the first time—that simple additive fusion substantially outperforms complex mechanisms (e.g., cross-attention) in both accuracy and computational efficiency, revealing strong complementarity between handcrafted and PTM features in time-frequency representation learning. Evaluated across multiple benchmark datasets, FusionVAD surpasses the state-of-the-art Pyannote with 2.04% higher average absolute performance and 3.2× faster inference speed, while using fewer parameters. The framework also exhibits significantly improved robustness and practical deployability.

Technology Category

Application Category

📝 Abstract

Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances VAD robustness while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MFCCs and PTM features for Voice Activity Detection

Proposing FusionVAD to combine features via simple fusion strategies

Demonstrating fusion models outperform single-feature and state-of-the-art methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines MFCCs and PTM features for VAD

Uses simple fusion strategies like addition

Outperforms state-of-the-art with 2.04% improvement

🔎 Similar Papers

No similar papers found.