WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a compact and efficient audio–text embedding approach that effectively leverages the Whisper encoder, addressing the limitations of existing methods which underutilize Whisper and suffer from high-dimensional embeddings and low efficiency. The method introduces a learnable global token into the Whisper audio encoder and jointly trains it with a text encoder, employing a two-stage training strategy combined with Matryoshka loss. This is the first framework to successfully integrate Whisper into audio–text embedding learning. The resulting model achieves state-of-the-art performance while enabling an 8× compression of embedding dimensions, delivering strong results on audio–text retrieval, multiple-choice question answering on AIR-Bench, and zero-shot classification tasks. Notably, it substantially reduces storage and computational costs with minimal performance degradation.

Technology Category

Application Category

📝 Abstract
Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x smaller embeddings with minimal performance drop. WavLink also demonstrates competitive performance on AIR-Bench with MCQs and zero-shot classification.
Problem

Research questions and friction points this paper is trying to address.

audio-text embedding
Whisper encoder
compact representation
retrieval performance
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

WavLink
Whisper encoder
global token
audio-text embedding
Matryoshka representation
🔎 Similar Papers
No similar papers found.
G
Gokul Karthik Kumar
Technology Innovation Institute, Abu Dhabi, UAE
L
Ludovick Lepauloux
Technology Innovation Institute, Abu Dhabi, UAE
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML