Learning Relationships Between Separate Audio Tracks for Creative Applications

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of modeling musical relationships between input and generated output in real-time music interaction. We propose a music agent training framework leveraging a separated-track database. Methodologically, we introduce the first end-to-end architecture integrating a symbolic decision-making module: a Transformer models and predicts musical relationships symbolically; Wav2Vec 2.0 serves as the perception module for audio representation extraction; and concatenative synthesis enables high-fidelity audio rendering. Our key contribution is the explicit encoding of pairwise track co-relationships—e.g., A→B—as a learnable symbolic decision process, a novel formulation in generative music systems. Experiments demonstrate that the model accurately reproduces the musical relationships observed in training data (e.g., A→B mappings), and under real-time guidance, generates semantically coherent and stylistically consistent response tracks. This significantly enhances controllability and expressive capability in creative music applications.

Technology Category

Application Category

📝 Abstract
This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).
Problem

Research questions and friction points this paper is trying to address.

Learning musical relationships between paired audio tracks for creative applications
Real-time generation of coherent musical output guided by live input
Training symbolic decision modules to predict musical relationships from separated tracks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers learn musical relationships from separated tracks
Wav2Vec 2.0 enables real-time audio perception module
Concatenative synthesis generates conditioned musical output
🔎 Similar Papers
No similar papers found.
B
Balthazar Bujard
ISMM Team, STMS Lab Ircam - CNRS - Sorbonne Université
J
Jérôme Nika
ISMM Team, STMS Lab Ircam - CNRS - Sorbonne Université
F
Fédéric Bevilacqua
ISMM Team, STMS Lab Ircam - CNRS - Sorbonne Université
Nicolas Obin
Nicolas Obin
Associate professor -- Ircam, Sorbonne Université
Speech SynthesisVoice ConversionGenerative AudioCreative Generation