🤖 AI Summary
This study addresses a critical limitation in current bioacoustic benchmarks, which predominantly rely on linear probes applied to the final layer of audio encoders and may thereby underestimate model performance by neglecting interactions between the encoder and the probing head. To remedy this, the authors systematically evaluate diverse probing strategies—including combinations of last-layer versus multi-layer features and linear versus attention-based probes—on the BEANs and BirdSet benchmarks. They propose adopting multi-layer attention probes as a more comprehensive approach to assessing representation quality. Experimental results demonstrate that multi-layer probes substantially improve downstream task performance, and that attention-based probes consistently outperform traditional linear probes for Transformer-based encoders, revealing a systematic underestimation of encoder capabilities under prevailing benchmark protocols.
📝 Abstract
Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.