Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the longstanding limitation in text-to-music generation research—namely, its reliance on proprietary data and industrial-scale computational resources, which has hindered the establishment of fair and open academic benchmarks. To bridge this gap, we launch the ICME 2026 Text-to-Music Generation Challenge, built upon a CC-licensed instrumental subset of MTG-Jamendo, featuring dual tracks focused on efficiency and performance, with the requirement that participants train models from scratch. Our initiative establishes the first standardized benchmark tailored for the academic community, introduces a novel Concept Coverage Score (CCS), and releases open-source baseline models, preprocessing pipelines, and evaluation code. The comprehensive evaluation framework integrates Fréchet Audio Distance, CLAP score, CCS, and subjective listening tests, offering a reproducible, multi-dimensional assessment that substantially lowers the barrier to entry for research in this domain.
📝 Abstract
This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.
Problem

Research questions and friction points this paper is trying to address.

text-to-music generation
academic research barrier
fair-play benchmark
open dataset
generative models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Music Generation
Academic Benchmark
Concept Coverage Score
Open-source Evaluation
Resource-constrained Training
🔎 Similar Papers
No similar papers found.