🤖 AI Summary
Existing music generation models rely on asynchronous text prompts, failing to meet the real-time responsiveness and embodied interaction requirements essential for improvisational performance—thus limiting their applicability in professional musical practice. To address this, we present a real-time human–AI co-creation system built upon a Yamaha Disklavier piano, wherein a human performer and the generative model Aria are coupled as alternating “dialogic agents” in improvisational dialogue. The system integrates low-latency audio triggering, real-time MIDI transcription, and style-aware phrase generation to achieve millisecond-scale response times and semantically coherent turn-taking. Experimental evaluation demonstrates that the AI maintains musical coherence, stylistic adaptability, and expressive nuance when generating performances directly on an acoustic piano. This work establishes the first embodied, synchronous, and instrument-native paradigm for human–machine musical co-creation.
📝 Abstract
While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To address this, we introduce Aria-Duet, an interactive system facilitating a real-time musical duet between a human pianist and Aria, a state-of-the-art generative model, using a Yamaha Disklavier as a shared physical interface. The framework enables a turn-taking collaboration: the user performs, signals a handover, and the model generates a coherent continuation performed acoustically on the piano. Beyond describing the technical architecture enabling this low-latency interaction, we analyze the system's output from a musicological perspective, finding the model can maintain stylistic semantics and develop coherent phrasal ideas, demonstrating that such embodied systems can engage in musically sophisticated dialogue and open a promising new path for human-AI co-creation.