Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

πŸ“… 2026-01-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of cross-modal bird species retrieval in the absence of audio-image paired data, where representation alignment is inherently difficult. The authors propose leveraging textual semantics as an intermediary by distilling the text embedding space of a pretrained image-text model (BioCLIP-2) into an audio-text model (BioLingual), followed by contrastive learning-based fine-tuning of the audio encoder. This approach achieves implicit alignment between audio and image embeddings without any direct audio-image supervision. By circumventing the conventional reliance on paired multimodal data, the method significantly outperforms zero-shot model ensembles and text-mapping baselines on bioacoustic benchmarks such as SSW60, while preserving audio discriminability and enhancing audio-text alignment.

Technology Category

Application Category

πŸ“ Abstract
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
Problem

Research questions and friction points this paper is trying to address.

audio-to-image retrieval
bioacoustic species recognition
audio-image alignment
data scarcity
cross-modal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

text distillation
audio-to-image retrieval
cross-modal alignment
bioacoustic species recognition
contrastive learning
πŸ”Ž Similar Papers
No similar papers found.