🤖 AI Summary
This work addresses key challenges in audio-driven piano motion generation—namely, inaccurate modeling of musical structure, rigid hand coordination mechanisms, and difficulties in real-time generation of long sequences—by introducing PianoFlow, a novel framework. PianoFlow decouples MIDI prior distillation from audio inference for the first time and incorporates asymmetric role-gated attention to explicitly model dynamic bimanual collaboration. It further enables efficient streaming generation of arbitrary length through an autoregressive flow continuation mechanism. Built upon a flow-matching generative architecture that integrates multimodal MIDI-audio distillation, PianoFlow significantly outperforms existing methods on the PianoMotion10M dataset, achieving over a 9× speedup in inference while preserving semantic fidelity and temporal coherence of the generated motions.
📝 Abstract
Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.