Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The BirdCLEF+ 2025 challenge imposes stringent constraints—90-minute CPU inference time limit and fine-grained classification across 206 avian species from bioacoustic recordings. Method: We propose a lightweight sequence modeling paradigm: first clustering Mel-spectrograms into discrete acoustic “vocalprint tokens” via Faiss-accelerated K-means, then learning unsupervised contextual embeddings using Word2Vec; for classification, we average token embeddings within 5-second windows and apply a linear classifier. This avoids computationally expensive end-to-end deep networks. Contribution/Results: Our optimized Perch model achieves 16-minute CPU inference and a public leaderboard ROC-AUC of 0.729. A novel static token–sequence grouping (STSG) variant further reduces inference to 6 minutes, attaining 0.559 ROC-AUC—demonstrating the efficacy and feasibility of static token embeddings for rapid, resource-efficient bioacoustic species classification.

Technology Category

Application Category

📝 Abstract
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEfficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram Token Skip-Gram (STSG) that treats bioacoustics as a sequence modeling task. This method converts audio into discrete "spectrogram tokens" by clustering Mel-spectrograms using Faiss K-means and then learns high-quality contextual embeddings for these tokens in an unsupervised manner with a Word2Vec skip-gram model. For classification, embeddings within a 5-second window are averaged and passed to a linear model. With a projected inference time of 6 minutes for a 700-minute test set, the STSG approach achieved a final ROC-AUC public score of 0.559 and a private score of 0.520, demonstrating the viability of fast tokenization approaches with static embeddings for bioacoustic classification. Supporting code for this paper can be found at https://github.com/dsgt-arc/birdclef-2025.
Problem

Research questions and friction points this paper is trying to address.

Classify 206 species under 90-minute CPU-only constraint
Optimize pre-trained models for fast bioacoustic inference
Introduce lightweight spectrogram tokenization for sequence modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized pre-trained models for CPU inference
Introduced Spectrogram Token Skip-Gram pipeline
Used unsupervised embeddings for bioacoustic classification
🔎 Similar Papers
No similar papers found.