Controllable Embedding Transformation for Mood-Guided Music Retrieval

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing music recommendation systems struggle to perform controllable, single-attribute editing—such as emotion modification—while preserving other invariant attributes (e.g., genre, instrumentation). To address this, we propose an emotion-guided embedding transformation framework. Our method introduces a learnable emotion mapping module and a surrogate objective sampling mechanism, integrated with a lightweight translation model and a joint optimization objective, enabling fine-grained, attribute-disentangled controllable music retrieval. Given audio embeddings as input, the framework leverages mood labels to steer directional embedding transformations, ensuring cross-attribute consistency while improving emotion conversion accuracy and output diversity. Experiments on two public benchmarks demonstrate that our approach significantly outperforms untrained baselines: it achieves substantial gains in emotion conversion accuracy and better preserves the original tracks’ genre and instrumentation characteristics.

Technology Category

Application Category

📝 Abstract

Music representations are the backbone of modern recommendation systems, powering playlist generation, similarity search, and personalized discovery. Yet most embeddings offer little control for adjusting a single musical attribute, e.g., changing only the mood of a track while preserving its genre or instrumentation. In this work, we address the problem of controllable music retrieval through embedding-based transformation, where the objective is to retrieve songs that remain similar to a seed track but are modified along one chosen dimension. We propose a novel framework for mood-guided music embedding transformation, which learns a mapping from a seed audio embedding to a target embedding guided by mood labels, while preserving other musical attributes. Because mood cannot be directly altered in the seed audio, we introduce a sampling mechanism that retrieves proxy targets to balance diversity with similarity to the seed. We train a lightweight translation model using this sampling strategy and introduce a novel joint objective that encourages transformation and information preservation. Extensive experiments on two datasets show strong mood transformation performance while retaining genre and instrumentation far better than training-free baselines, establishing controllable embedding transformation as a promising paradigm for personalized music retrieval.

Problem

Research questions and friction points this paper is trying to address.

Achieving controllable music retrieval through embedding transformation

Modifying single musical attributes like mood while preserving others

Learning mood-guided mappings to transform audio embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns mapping from seed to target embedding guided by mood

Introduces sampling mechanism for proxy targets to balance diversity

Trains lightweight translation model with joint transformation and preservation objective

🔎 Similar Papers

No similar papers found.