🤖 AI Summary
Existing general-purpose audio representations are limited in intelligent music production due to coarse-grained modeling of audio effects (FX), particularly lacking instrument-level FX understanding—hindering fine-grained tasks such as automatic mixing. To address this, we propose Fx-Encoder++, the first model capable of directly extracting instrument-level FX representations from mixed audio signals. Its core innovations include an instrument-query-based embedding extraction mechanism and a multimodal (audio/text) contrastive learning framework, enabling disentangled mapping from mix-level FX embeddings to instrument-specific FX representations. Experiments demonstrate that Fx-Encoder++ significantly outperforms baseline methods on cross-instrument FX retrieval and FX parameter matching tasks. It effectively bridges the critical gap in fine-grained FX modeling for music generation, advancing the capability to reason about effect processing at the individual instrument level within complex mixes.
📝 Abstract
General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have emphasized audio effects analysis at the mixture level, this focus falls short for tasks demanding instrument-wise audio effects understanding, such as automatic mixing. In this work, we present Fx-Encoder++, a novel model designed to extract instrument-wise audio effects representations from music mixtures. Our approach leverages a contrastive learning framework and introduces an "extractor" mechanism that, when provided with instrument queries (audio or text), transforms mixture-level audio effects embeddings into instrument-wise audio effects embeddings. We evaluated our model across retrieval and audio effects parameter matching tasks, testing its performance across a diverse range of instruments. The results demonstrate that Fx-Encoder++ outperforms previous approaches at mixture level and show a novel ability to extract effects representation instrument-wise, addressing a critical capability gap in intelligent music production systems.