Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Estimating dynamic levels in piano performance remains a fundamental challenge in computational music analysis. This paper introduces a novel multi-task, multi-scale neural network that achieves, for the first time, end-to-end joint modeling of four interrelated musical attributes directly from audio: dynamic level (e.g., *p*, *f*), dynamic change points, beat positions, and downbeats. Leveraging Bark-scale loudness features and a shared latent representation architecture, the model significantly reduces parameter count while preserving temporal modeling capacity; it supports 60-second audio sequences, balancing expressiveness and computational efficiency. Evaluated on the MazurkaBL dataset, our approach achieves state-of-the-art performance across all four tasks, compressing model parameters from 14.7M to just 0.5M. This work establishes the first compact, efficient, and structurally coherent benchmark model for piano expressivity analysis.

Technology Category

Application Category

📝 Abstract
Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.
Problem

Research questions and friction points this paper is trying to address.

Joint estimation of piano dynamics and metrical structure from audio
Developing compact multi-task network for dynamic levels and beat tracking
Enabling long-sequence analysis with efficient Bark-scale loudness features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task network predicts dynamics and metrical structure
Multi-scale backbone uses Bark-scale loudness input
Compact model enables 60-second audio analysis
🔎 Similar Papers
No similar papers found.
Zhanhong He
Zhanhong He
PhD Student, University of Western Australia
Automatic Music TranscriptionAudio Processing
H
Hanyu Meng
The University of New South Wales, Sydney, Australia
D
Defeng (David) Huang
The University of Western Australia, Perth, Australia
Roberto Togneri
Roberto Togneri
The University of Western Australia
Speech ProcessingImage ProcessingPattern Recognition