🤖 AI Summary
Existing music recommendation systems struggle to perform controllable, single-attribute editing—such as emotion modification—while preserving other invariant attributes (e.g., genre, instrumentation). To address this, we propose an emotion-guided embedding transformation framework. Our method introduces a learnable emotion mapping module and a surrogate objective sampling mechanism, integrated with a lightweight translation model and a joint optimization objective, enabling fine-grained, attribute-disentangled controllable music retrieval. Given audio embeddings as input, the framework leverages mood labels to steer directional embedding transformations, ensuring cross-attribute consistency while improving emotion conversion accuracy and output diversity. Experiments on two public benchmarks demonstrate that our approach significantly outperforms untrained baselines: it achieves substantial gains in emotion conversion accuracy and better preserves the original tracks’ genre and instrumentation characteristics.
📝 Abstract
Music representations are the backbone of modern recommendation systems, powering playlist generation, similarity search, and personalized discovery. Yet most embeddings offer little control for adjusting a single musical attribute, e.g., changing only the mood of a track while preserving its genre or instrumentation. In this work, we address the problem of controllable music retrieval through embedding-based transformation, where the objective is to retrieve songs that remain similar to a seed track but are modified along one chosen dimension. We propose a novel framework for mood-guided music embedding transformation, which learns a mapping from a seed audio embedding to a target embedding guided by mood labels, while preserving other musical attributes. Because mood cannot be directly altered in the seed audio, we introduce a sampling mechanism that retrieves proxy targets to balance diversity with similarity to the seed. We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation. Extensive experiments on two datasets show strong mood transformation performance while retaining genre and instrumentation far better than training-free baselines, establishing controllable embedding transformation as a promising paradigm for personalized music retrieval.