BAT: Better Audio Transformer Guided by Convex Gated Probing

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the lack of reliable evaluation mechanisms for existing audio self-supervised learning models, where fine-tuning often distorts embedding quality and conventional probing methods fail to fully exploit model capabilities, leading to misleading performance rankings. To overcome this, we propose the Convex Gated Probe (CGP), which integrates prototype learning with a gating mechanism to efficiently identify task-relevant features while keeping model parameters frozen. CGP substantially narrows the performance gap between probing and fine-tuning and, for the first time, explicitly reveals the spatial distribution of task-specific information within self-supervised audio models. Guided by CGP insights, we redesign data preprocessing, model architecture, and pretraining strategies to develop BAT, a novel audio Transformer that achieves state-of-the-art performance on major benchmarks including AudioSet.

Technology Category

Application Category

📝 Abstract

Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

Problem

Research questions and friction points this paper is trying to address.

audio self-supervised learning

probing

fine-tuning

embedding evaluation

AudioSet

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convex Gated Probing

Better Audio Transformer

self-supervised learning