OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing BEATs models suffer from limited generalizability due to reliance on closed-source pretraining code and exclusive training on the AudioSet dataset. To address this, we propose the first fully open-source, multi-domain audio pretraining framework that extends BEATs into a general-purpose audio encoder. Our method employs masked token prediction as the pretraining objective and integrates diverse audio data—including bioacoustics, environmental sounds, and complex reasoning-oriented audio—to enhance representation capacity. The resulting model supports audio understanding, reasoning, and cross-modal applications. Evaluated on six bioacoustic, two environmental sound, and five audio reasoning benchmarks, it achieves state-of-the-art performance across all tasks. Remarkably, it surpasses billion-parameter models using only 25% of their parameter count. This work establishes the first efficient, general-purpose audio representation learned via collaborative multi-domain pretraining, significantly improving both model generalizability and practical utility.

Technology Category

Application Category

📝 Abstract

Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs

Problem

Research questions and friction points this paper is trying to address.

Lack of open-source general-purpose audio encoder models

Limited application of masked token prediction in audio understanding

Restricted downstream performance due to single-domain pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework for general audio encoding

Multi-domain audio pre-training with masked tokens

State-of-the-art performance with compact model size

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs