PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music

πŸ“… 2025-09-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
General-purpose music representation models struggle to capture fine-grained semantic distinctions when trained on homogeneous solo piano data, while existing piano-specific models are predominantly unimodal and neglect intrinsic cross-modal relationships among audio, symbolic (MIDI), and textual representations. Method: We propose the first multimodal joint embedding framework tailored for popular piano music, built upon a Transformer architecture that fuses audio, MIDI, and text inputs. It employs contrastive learning and cross-modal alignment to enforce semantic consistency, augmented by a modality-cooperative mechanism and multi-source training strategy to enhance representation learning under data scarcity. Contribution/Results: Experiments demonstrate substantial improvements over general music models on both in-domain and cross-domain piano text-to-music retrieval tasks, validating the framework’s effectiveness in modeling nuanced musical expressions and its strong generalization capability. This work establishes a reusable paradigm for domain-specific multimodal music understanding.

Technology Category

Application Category

πŸ“ Abstract
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.
Problem

Research questions and friction points this paper is trying to address.

Captures subtle semantic distinctions in solo piano music
Addresses unimodal limitations in piano-specific representation models
Optimizes joint embedding for small-scale homogeneous datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

PianoBind multimodal joint embedding model
Multi-source training for fine-grained semantics
Superior text-to-music retrieval performance
πŸ”Ž Similar Papers
No similar papers found.