CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the lack of effective evaluation mechanisms for music generation models that handle compositional multimodal instructions (CMIs), such as combinations of text, lyrics, and reference audio. We propose the first reward modeling framework tailored for CMI-based music generation, introducing a large-scale preference dataset comprising both human annotations and pseudo-labels, along with a unified multidimensional benchmark evaluating musicality, text-to-music alignment, and instruction fidelity. The proposed CMI-RM employs a parameter-efficient architecture that effectively fuses heterogeneous input modalities and leverages top-k filtering during inference for scalable deployment. Experimental results demonstrate that CMI-RM’s scores exhibit strong correlation with human judgments. All data, models, and evaluation benchmarks are publicly released to support future research.

Technology Category

Application Category

📝 Abstract

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

Problem

Research questions and friction points this paper is trying to address.

music generation

reward modeling

multimodal instruction

evaluation benchmark

preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compositional Multimodal Instruction

Music Reward Model

Preference Dataset

Unified Benchmark

Parameter-Efficient Reward Modeling

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges

2024-06-13arXiv.orgCitations: 3