Exploring Adapter Design Tradeoffs for Low Resource Music Generation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of adapting large music generation models to low-resource musical traditions—specifically Hindustani classical and Turkish makam—where labeled data and domain expertise are scarce. We investigate architectural design, insertion strategy, and capacity trade-offs for parameter-efficient fine-tuning via adapters. We propose a dual-adapter framework: a convolutional adapter to capture fine-grained melodic ornamentation, and a Transformer-based adapter to preserve long-range improvisational structure—marking the first systematic validation of convolutional adapters for granular music feature modeling. Experiments identify a 40M-parameter medium-scale adapter as optimal, balancing representational capacity and generation fidelity. Evaluations on both autoregressive (MusicGen) and diffusion-based (Mustango) models demonstrate accelerated training, improved pitch-rhythm stability, enhanced prompt adherence, >95% reduction in trainable parameters, and performance approaching full-parameter fine-tuning.

Technology Category

Application Category

📝 Abstract

Fine-tuning large-scale music generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music. Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations.

Problem

Research questions and friction points this paper is trying to address.

Optimizing adapter design for low-resource music generation models

Comparing performance of convolution vs transformer-based adapters

Balancing computational efficiency and output quality in PEFT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapter-based PEFT for efficient music generation

Convolution adapters capture local musical details

Mid-sized adapters balance expressivity and quality

🔎 Similar Papers

No similar papers found.

Authors to Follow