🤖 AI Summary
This work investigates the mechanism and role of frequency-dependent modeling in sound event detection (SED). Focusing on FilterAugment and frequency-dynamic convolution (FDY Conv), we conduct systematic analyses—including class-wise performance evaluation, Grad-CAM visualization, frequency-domain PCA representation, and ablation studies—to characterize their intrinsic properties along three dimensions: frequency-domain interpretability, component-wise contribution, and class-sensitive adaptivity. We propose a lightweight variant of frequency-dependent convolution. Empirical results show that FDY Conv’s dynamic kernels exhibit strong class specificity, eliciting discriminative frequency responses per class. In contrast, while FilterAugment improves overall generalization, it degrades recognition accuracy for rare classes. Our findings confirm that explicit frequency-dependency modeling is critical for SED performance, and that effective designs must jointly optimize class-balanced learning and computational efficiency.
📝 Abstract
In this work, various analysis methods are conducted on frequency-dependent methods on SED to further delve into their detailed characteristics and behaviors on SED. While SED has been rapidly advancing through the adoption of various deep learning techniques from other pattern recognition fields, these techniques are often not suitable for SED. To address this issue, two frequency-dependent SED methods were previously proposed: FilterAugment, a data augmentation randomly weighting frequency bands, and frequency dynamic convolution (FDY Conv), an architecture applying frequency adaptive convolution kernels. These methods have demonstrated superior performance in SED, and we aim to further analyze their detailed effectiveness and characteristics in SED. We compare class-wise performance to find out specific pros and cons of FilterAugment and FDY Conv. We apply Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights time-frequency region that is more inferred by the model, on SED models with and without frequency masking and two types of FilterAugment to observe their detailed characteristics. We propose simpler frequency dependent convolution methods and compare them with FDY Conv to further understand which components of FDY Conv affects SED performance. Lastly, we apply PCA to show how FDY Conv adapts dynamic kernel across frequency dimensions on different sound event classes. The results and discussions demonstrate that frequency dependency plays a significant role in sound event detection and further confirms the effectiveness of frequency dependent methods on SED.