π€ AI Summary
General-purpose music representation models struggle to capture fine-grained semantic distinctions when trained on homogeneous solo piano data, while existing piano-specific models are predominantly unimodal and neglect intrinsic cross-modal relationships among audio, symbolic (MIDI), and textual representations.
Method: We propose the first multimodal joint embedding framework tailored for popular piano music, built upon a Transformer architecture that fuses audio, MIDI, and text inputs. It employs contrastive learning and cross-modal alignment to enforce semantic consistency, augmented by a modality-cooperative mechanism and multi-source training strategy to enhance representation learning under data scarcity.
Contribution/Results: Experiments demonstrate substantial improvements over general music models on both in-domain and cross-domain piano text-to-music retrieval tasks, validating the frameworkβs effectiveness in modeling nuanced musical expressions and its strong generalization capability. This work establishes a reusable paradigm for domain-specific multimodal music understanding.
π Abstract
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.