🤖 AI Summary
Traditional frame-level metrics fail to capture musically salient features—such as direction-change boundaries and contour morphology—in piano sustain pedal depth estimation, resulting in evaluations lacking semantic interpretability. To address this, we propose the first multi-granularity evaluation framework integrating action-level (direction-switch point detection) and gesture-level (contour alignment and shape similarity) assessments, thereby overcoming the limitations of frame-wise accuracy alone. Methodologically, we introduce segmented state detection coupled with dynamic time warping–guided contour alignment, enabling unified comparison across audio-only baselines, MIDI-augmented models, and their binarized variants. Experiments demonstrate that the MIDI-augmented model achieves significant gains at both action and gesture levels, while exhibiting only marginal improvement in frame-level accuracy—validating that our framework is semantically sensitive, highly interpretable, and capable of revealing substantive performance improvements overlooked by conventional metrics.
📝 Abstract
Evaluation for continuous piano pedal depth estimation tasks remains incomplete when relying only on conventional frame-level metrics, which overlook musically important features such as direction-change boundaries and pedal curve contours. To provide more interpretable and musically meaningful insights, we propose an evaluation framework that augments standard frame-level metrics with an action-level assessment measuring direction and timing using segments of press/hold/release states and a gesture-level analysis that evaluates contour similarity of each press-release cycle. We apply this framework to compare an audio-only baseline with two variants: one incorporating symbolic information from MIDI, and another trained in a binary-valued setting, all within a unified architecture. Results show that the MIDI-informed model significantly outperforms the others at action and gesture levels, despite modest frame-level gains. These findings demonstrate that our framework captures musically relevant improvements indiscernible by traditional metrics, offering a more practical and effective approach to evaluating pedal depth estimation models.