🤖 AI Summary
Existing computational bioacoustics methods often disregard the temporal structure of animal vocalizations, resorting to frame-level feature averaging and thereby discarding sequential information. To address this, we propose a self-supervised speech-model-based framework for modeling discrete acoustic token sequences: HuBERT is employed to extract continuous representations, which are then discretized via vector quantization and Gumbel-Softmax to yield differentiable, temporally structured vocal unit token sequences; a Levenshtein distance–driven k-nearest neighbors classifier is subsequently applied for sequence-level analysis. Experiments across four animal acoustic datasets demonstrate substantial improvements in call-type and individual identification performance. This work provides the first systematic validation of the discriminative efficacy of discrete temporal tokenization in bioacoustics, establishing a novel paradigm for sequence-structure–driven analysis of animal communication.
📝 Abstract
Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.