PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

General-purpose music representation models struggle to capture fine-grained semantic distinctions when trained on homogeneous solo piano data, while existing piano-specific models are predominantly unimodal and neglect intrinsic cross-modal relationships among audio, symbolic (MIDI), and textual representations. Method: We propose the first multimodal joint embedding framework tailored for popular piano music, built upon a Transformer architecture that fuses audio, MIDI, and text inputs. It employs contrastive learning and cross-modal alignment to enforce semantic consistency, augmented by a modality-cooperative mechanism and multi-source training strategy to enhance representation learning under data scarcity. Contribution/Results: Experiments demonstrate substantial improvements over general music models on both in-domain and cross-domain piano text-to-music retrieval tasks, validating the framework’s effectiveness in modeling nuanced musical expressions and its strong generalization capability. This work establishes a reusable paradigm for domain-specific multimodal music understanding.

Technology Category

Application Category

📝 Abstract

Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.

Problem

Research questions and friction points this paper is trying to address.

Captures subtle semantic distinctions in solo piano music

Addresses unimodal limitations in piano-specific representation models

Optimizes joint embedding for small-scale homogeneous datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

PianoBind multimodal joint embedding model

Multi-source training for fine-grained semantics

Superior text-to-music retrieval performance

🔎 Similar Papers

MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision