Towards Leveraging Sequential Structure in Animal Vocalizations

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing computational bioacoustics methods often disregard the temporal structure of animal vocalizations, resorting to frame-level feature averaging and thereby discarding sequential information. To address this, we propose a self-supervised speech-model-based framework for modeling discrete acoustic token sequences: HuBERT is employed to extract continuous representations, which are then discretized via vector quantization and Gumbel-Softmax to yield differentiable, temporally structured vocal unit token sequences; a Levenshtein distance–driven k-nearest neighbors classifier is subsequently applied for sequence-level analysis. Experiments across four animal acoustic datasets demonstrate substantial improvements in call-type and individual identification performance. This work provides the first systematic validation of the discriminative efficacy of discrete temporal tokenization in bioacoustics, establishing a novel paradigm for sequence-structure–driven analysis of animal communication.

Technology Category

Application Category

📝 Abstract

Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.

Problem

Research questions and friction points this paper is trying to address.

Capturing temporal sequential structures in animal vocalizations

Using discrete token sequences for acoustic pattern recognition

Developing alternative representations to leverage sequential information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete acoustic token sequences

Applies vector quantization to representations

Leverages sequential structure in vocalizations

🔎 Similar Papers

No similar papers found.