🤖 AI Summary
This work addresses the challenge of modeling musical relationships between input and generated output in real-time music interaction. We propose a music agent training framework leveraging a separated-track database. Methodologically, we introduce the first end-to-end architecture integrating a symbolic decision-making module: a Transformer models and predicts musical relationships symbolically; Wav2Vec 2.0 serves as the perception module for audio representation extraction; and concatenative synthesis enables high-fidelity audio rendering. Our key contribution is the explicit encoding of pairwise track co-relationships—e.g., A→B—as a learnable symbolic decision process, a novel formulation in generative music systems. Experiments demonstrate that the model accurately reproduces the musical relationships observed in training data (e.g., A→B mappings), and under real-time guidance, generates semantically coherent and stylistically consistent response tracks. This significantly enhances controllability and expressive capability in creative music applications.
📝 Abstract
This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).