đ¤ AI Summary
Assessing the difficulty of diverse piano audio performances remains challenging in music education due to the absence of symbolic-level transcriptions. Method: This paper introduces the first purely audio-driven piano performance difficulty assessment framework. We construct PSyllabus, a large-scale benchmark comprising 7,901 pieces annotated across 11 difficulty levelsâfilling a critical gap in Music Information Retrieval (MIR) for audio-only difficulty modeling. Our unified recognition architecture flexibly integrates unimodal or multimodal representationsâincluding MFCCs, log-Mel spectrograms, and statistical features of rhythm and pitchâextracted via OpenL3 or PANNs, and employs CNN or Transformer backbones for multi-task joint training. Results: Experiments demonstrate that raw audio contains substantial discriminative information for difficulty estimation; the multimodal approach achieves an average accuracy gain of 9.2% over unimodal baselines. All data, code, and models are publicly released, establishing the first standardized resource for this task.
đ Abstract
Automatically estimating the performance difficulty of a music piece represents a key process in music education to create tailored curricula according to the individual needs of the students. Given its relevance, the Music Information Retrieval (MIR) field comprises some proof-of-concept works addressing this task that mainly focus on high-level music abstractions such as machine-readable scores or music sheet images. In this regard, the potential of directly analyzing audio recordings has generally been neglected. This work addresses this gap in the field with two contributions: (i) PSyllabus, the first audio-based difficulty estimation datasetâcollected from Piano Syllabus communityâfeaturing 7,901 piano pieces across 11 difficulty levels from 1,233 composers as well as two additional benchmark datasets particularly compiled for evaluation purposes; and (ii) a recognition framework capable of managing different input representationsâboth in unimodal and multimodal mannersâderived from audio to perform the difficulty estimation task. The comprehensive experimentation comprising different pre-training schemes, input modalities, and multi-task scenarios proves the validity of the hypothesis and establishes PSyllabus as a reference dataset for audio-based difficulty estimation in the MIR field. The dataset, developed code, and trained models are publicly shared to promote further research in the field.