PianoVAM: A Multimodal Piano Performance Dataset

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current music information retrieval (MIR) lacks a comprehensive multimodal piano dataset captured in authentic practice settings, concurrently recording video, audio, MIDI, hand keypoint trajectories, fingerings, and rich metadata. To address this gap, we introduce the first large-scale, in-the-wild piano performance dataset synchronously capturing all six modalities. Our methodology integrates a Disklavier piano for high-fidelity MIDI and audio acquisition, overhead RGB video, pre-trained hand pose estimation models, and cross-modal temporal alignment techniques. We further propose a semi-automatic fingering annotation pipeline that significantly improves both accuracy and efficiency. The dataset is publicly released and accompanied by a novel benchmark for audio-only and audio-visual piano transcription. Experimental results demonstrate its strong utility for multimodal music understanding and performance analysis tasks.

Technology Category

Application Category

📝 Abstract
The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.
Problem

Research questions and friction points this paper is trying to address.

Collect multimodal piano performance data
Align synchronized video, audio, MIDI modalities
Develop automated fingering annotation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with video, audio, MIDI
Hand landmarks extracted using pose estimation
Semi-automated fingering annotation algorithm applied
🔎 Similar Papers
No similar papers found.