🤖 AI Summary
Dialectal variation significantly degrades cross-regional automatic bird call recognition in passive acoustic monitoring. To address this, we propose a dialect-robust bird call recognition framework: (1) a TDNN-based architecture incorporating frequency-sensitive normalization (IFN and gated Relaxed-IFN) with gradient reversal adversarial training to learn region-invariant acoustic representations; (2) a Dialect Calibration Augmentation (DCA) mechanism that softly down-weights synthetic samples to suppress generation artifacts; and (3) multi-level data augmentation for rare classes via CycleGAN-based style transfer and Mixup. Evaluated on the DB3V dataset, our method achieves up to a 20-percentage-point improvement in cross-dialect recognition accuracy without compromising intra-regional performance. Grad-CAM and LIME visualizations confirm that the model attends to ecologically meaningful, stable harmonic frequency bands—demonstrating both high robustness and interpretability.
📝 Abstract
Dialect variation hampers automatic recognition of bird calls collected by passive acoustic monitoring. We address the problem on DB3V, a three-region, ten-species corpus of 8-s clips, and propose a deployable framework built on Time-Delay Neural Networks (TDNNs). Frequency-sensitive normalisation (Instance Frequency Normalisation and a gated Relaxed-IFN) is paired with gradient-reversal adversarial training to learn region-invariant embeddings. A multi-level augmentation scheme combines waveform perturbations, Mixup for rare classes, and CycleGAN transfer that synthesises Region 2 (Interior Plains)-style audio, , with Dialect-Calibrated Augmentation (DCA) softly down-weighting synthetic samples to limit artifacts. The complete system lifts cross-dialect accuracy by up to twenty percentage points over baseline TDNNs while preserving in-region performance. Grad-CAM and LIME analyses show that robust models concentrate on stable harmonic bands, providing ecologically meaningful explanations. The study demonstrates that lightweight, transparent, and dialect-resilient bird-sound recognition is attainable.