HeartMuLa: A Family of Open Sourced Music Foundation Models

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work addresses the current lack of open-source, scalable, and multimodally controllable high-quality music foundation models. The authors present a comprehensive open-source framework featuring a 12.5 Hz low-frame-rate yet high-fidelity audio codec, an audio–text contrastive alignment module, and a robust lyric recognition component. Integrated with a 7B-parameter large language model, the system enables autoregressive song generation controlled by natural language prompts at the segment level. It supports fine-grained stylistic control and short-video soundtrack synthesis, achieving strong performance across multiple tasks. The generated audio quality approaches that of commercial systems such as Suno, marking the first academic effort to replicate commercial-grade results under resource-constrained settings. This work provides the research community with a strong baseline and practical tools for controllable music generation.

Technology Category

Application Category

📝 Abstract

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.

Problem

Research questions and friction points this paper is trying to address.

Music Foundation Models

Open Source

Multimodal Music Generation

Audio-Text Alignment

High-Fidelity Music Synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Music Foundation Models

High-Fidelity Music Codec

LLM-based Music Generation