🤖 AI Summary
This work addresses the limitations of existing music generation models, which are predominantly trained on Western music and thus struggle to capture the distinctive modal systems (Dastgah), timbral qualities, and rhythmic structures inherent to Persian music. To bridge this gap, the authors construct the first large-scale, high-quality Persian music dataset—comprising over 900 hours of audio spanning traditional, popular, and contemporary subgenres—and adapt the MusicGen model through domain-specific fine-tuning. Evaluations combining subjective listening tests and objective metrics, including semantic alignment with style labels, demonstrate that the proposed approach significantly enhances cultural coherence and stylistic fidelity in generated outputs. This study marks the first successful transfer of a general-purpose music generation model to a non-Western musical context.
📝 Abstract
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.