🤖 AI Summary
Systematic comparative evaluation of STFT spectrograms versus wavelet scalograms as CNN inputs for acoustic recognition remains lacking. Method: This work conducts the first fair, comprehensive performance attribution analysis under a unified CNN architecture, rigorously assessing noise robustness, time-frequency resolution, and task-specific adaptability. Experiments employ a multi-SNR acoustic dataset with standardized preprocessing (Mel filtering) and feature extraction (STFT and continuous wavelet transform, CWT). Contribution/Results: Spectrograms achieve 2.1% higher accuracy on stationary speech recognition, whereas scalograms yield a 5.7% improvement in F1-score for transient/non-stationary sound classification (e.g., knocks, alarms). The study elucidates fundamental representational differences between these time-frequency representations and provides an empirically grounded, scenario-aware decision-making framework for selecting optimal features in acoustic recognition tasks.
📝 Abstract
Acoustic recognition has emerged as a prominent task in deep learning research, frequently utilizing spectral feature extraction techniques such as the spectrogram from the Short-Time Fourier Transform and the scalogram from the Wavelet Transform. However, there is a notable deficiency in studies that comprehensively discuss the advantages, drawbacks, and performance comparisons of these methods. This paper aims to evaluate the characteristics of these two transforms as input data for acoustic recognition using Convolutional Neural Networks. The performance of the trained models employing both transforms is documented for comparison. Through this analysis, the paper elucidates the advantages and limitations of each method, provides insights into their respective application scenarios, and identifies potential directions for further research.