🤖 AI Summary
Cross-modal (score, performance signal, audio) and cross-lingual generalization—especially under unaligned modalities and unseen languages—remains challenging in music information retrieval (MIR).
Method: We propose the first music–multilingual text joint contrastive learning framework. It employs a zero-shot multilingual text encoder to align heterogeneous modalities in a shared representation space, using text as a semantic bridge to enable cross-modal retrieval without paired data.
Contribution/Results: We introduce M4-RAG, the first large-scale dataset featuring fine-grained ethnomusicological metadata, and WikiMT-X, a novel three-modal evaluation benchmark. Our framework achieves significant improvements over state-of-the-art methods across diverse cross-modal and cross-lingual MIR tasks, demonstrating strong generalization capability and practical utility.
📝 Abstract
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities--including sheet music, performance signals, and audio recordings--with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.