🤖 AI Summary
Traditional bioacoustic classification systems are constrained by a 16 kHz sampling rate, utilizing only the baseband (0–8 kHz) and neglecting the high-frequency—often ultrasonic—components prevalent in animal vocalizations. This work proposes a multi-band encoding framework that decomposes full-spectrum animal calls into multiple frequency bands and fuses them into a unified representation for classification. It presents the first systematic investigation into the efficacy of multi-band decomposition and fusion strategies for full-spectrum bioacoustic classification, revealing that specific encoder architectures can produce decorrelated band-wise embeddings that substantially enhance class separability. Experimental results across three datasets demonstrate that the proposed fused representation significantly outperforms both baseband-only and time-stretched baseline methods on two of the datasets.
📝 Abstract
Animals hear and vocalize across frequency ranges that differ substantially from humans, often extending into the ultrasonic domain. Yet most computational bioacoustics systems rely on audio models pre-trained at 16 kHz, restricting their usable bandwidth to the 0-8 kHz baseband and discarding higher-frequency information present in many bioacoustic recordings. We investigate a multi-band encoding framework that decomposes the full spectrum of animal calls into band features and fuses them into a unified representation. Similarity analyses on models show that certain encoders produce decorrelated band embeddings that improve class separation after fusion. Classification experiments on three bioacoustic datasets using eight pre-trained models and five fusion strategies show that fused representations consistently outperform the baseband and time-expansion baselines on two datasets, showing the potential of multi-band methods for full-spectrum encoding of animal calls.