🤖 AI Summary
This work addresses the limited cross-domain representation capability of general-purpose audio encoders in handling speech, music, and environmental sounds. Building upon the BEATs architecture, the authors conduct domain-mixed pretraining using 74,000 hours of multi-source audio data and propose an efficient model ensemble method that integrates Dasheng 1.2B with two extended BEATs variants. By systematically comparing speech-dominant and balanced data mixing strategies, the ensemble preserves the strengths of individual submodels while significantly enhancing overall performance. The resulting system outperforms both the official baseline and Dasheng 1.2B in the ICME 2025 Audio Encoding Challenge, and the associated models have been publicly released on the Hugging Face platform.
📝 Abstract
This technical report describes our submission to the ICME 2025 audio encoder challenge. Our submitted system is built on BEATs, a masked speech token prediction based audio encoder. We extend the BEATs model using 74,000 hours of data derived from various speech, music, and sound corpora and scale its architecture upto 300 million parameters. We experiment with speech-heavy and balanced pre-training mixtures to study the impact of different domains on final performance. Our submitted system consists of an ensemble of the Dasheng 1.2 billion model with two custom scaled-up BEATs models trained on the aforementioned pre-training data mixtures. We also propose a simple ensembling technique that retains the best capabilities of constituent models and surpasses both the baseline and Dasheng 1.2B. For open science, we publicly release our trained checkpoints via huggingface at https://huggingface.co/shikhar7ssu/OpenBEATs-ICME-SOUND and https://huggingface.co/shikhar7ssu/OpenBEATs-ICME.