PianoVAM: A Multimodal Piano Performance Dataset

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current music information retrieval (MIR) lacks a comprehensive multimodal piano dataset captured in authentic practice settings, concurrently recording video, audio, MIDI, hand keypoint trajectories, fingerings, and rich metadata. To address this gap, we introduce the first large-scale, in-the-wild piano performance dataset synchronously capturing all six modalities. Our methodology integrates a Disklavier piano for high-fidelity MIDI and audio acquisition, overhead RGB video, pre-trained hand pose estimation models, and cross-modal temporal alignment techniques. We further propose a semi-automatic fingering annotation pipeline that significantly improves both accuracy and efficiency. The dataset is publicly released and accompanied by a novel benchmark for audio-only and audio-visual piano transcription. Experimental results demonstrate its strong utility for multimodal music understanding and performance analysis tasks.

Technology Category

Application Category

📝 Abstract

The multimodal nature of music performance has driven increasing interest in data beyond the audio domain within the music information retrieval (MIR) community. This paper introduces PianoVAM, a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata. The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions, alongside synchronized top-view videos in realistic and varied performance conditions. Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm. We discuss the challenges encountered during data collection and the alignment process across different modalities. Additionally, we describe our fingering annotation method based on hand landmarks extracted from videos. Finally, we present benchmarking results for both audio-only and audio-visual piano transcription using the PianoVAM dataset and discuss additional potential applications.

Problem

Research questions and friction points this paper is trying to address.

Collect multimodal piano performance data

Align synchronized video, audio, MIDI modalities

Develop automated fingering annotation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with video, audio, MIDI

Hand landmarks extracted using pose estimation

Semi-automated fingering annotation algorithm applied

🔎 Similar Papers

Can Audio Reveal Music Performance Difficulty? Insights From the Piano Syllabus Dataset