AudioFuse: Unified Spectral-Temporal Learning via a Hybrid ViT-1D CNN Architecture for Robust Phonocardiogram Classification

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the loss of phase information and insufficient temporal resolution in 2D spectrograms for phonocardiogram (PCG) classification, this paper proposes a hybrid ViT-1D CNN architecture that jointly exploits spectral and time-domain features. The model employs a dual-path input—2D spectrograms and raw 1D waveforms—processed in parallel by a wide-but-shallow Vision Transformer (ViT) and a shallow 1D CNN, respectively; high-level features are adaptively fused to enable end-to-end unified modeling without pretraining. This design significantly enhances generalization and robustness. Evaluated on PhysioNet 2016, the method achieves a state-of-the-art ROC-AUC of 0.8608. Moreover, it demonstrates strong domain adaptability, attaining an ROC-AUC of 0.7181 on the cross-domain PASCAL dataset—substantially outperforming prior approaches. The architecture thus offers a principled, pretrained-free solution for robust, generalizable PCG classification.

Technology Category

Application Category

📝 Abstract
Biomedical audio signals, such as phonocardiograms (PCG), are inherently rhythmic and contain diagnostic information in both their spectral (tonal) and temporal domains. Standard 2D spectrograms provide rich spectral features but compromise the phase information and temporal precision of the 1D waveform. We propose AudioFuse, an architecture that simultaneously learns from both complementary representations to classify PCGs. To mitigate the overfitting risk common in fusion models, we integrate a custom, wide-and-shallow Vision Transformer (ViT) for spectrograms with a shallow 1D CNN for raw waveforms. On the PhysioNet 2016 dataset, AudioFuse achieves a state-of-the-art competitive ROC-AUC of 0.8608 when trained from scratch, outperforming its spectrogram (0.8066) and waveform (0.8223) baselines. Moreover, it demonstrates superior robustness to domain shift on the challenging PASCAL dataset, maintaining an ROC-AUC of 0.7181 while the spectrogram baseline collapses (0.4873). Fusing complementary representations thus provides a strong inductive bias, enabling the creation of efficient, generalizable classifiers without requiring large-scale pre-training.
Problem

Research questions and friction points this paper is trying to address.

Classifying phonocardiograms using spectral and temporal data
Overcoming limitations of 2D spectrograms and 1D waveforms
Improving robustness against domain shift in biomedical audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid ViT-1D CNN architecture fuses spectral-temporal learning
Wide-and-shallow ViT processes spectrograms with 1D CNN
Fusion model enhances robustness and prevents overfitting
🔎 Similar Papers
No similar papers found.
M
Md. Saiful Bari Siddiqui
Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
Utsab Saha
Utsab Saha
Lecturer, Dept. of CSE, BRAC University | Student, M.Sc. Engg. in EEE, BUET
Signal ProcessingAI in HealthcareDeep LearningDifferential Privacy