🤖 AI Summary
Existing music-to-palette generation methods suffer from two key limitations: (1) they output only a single dominant color, failing to capture dynamic emotional shifts in music; or (2) they rely on intermediate text or image representations, leading to loss of fine-grained emotional semantics. To address this, we propose an end-to-end cross-modal generation framework. We introduce MuCED—the first professionally annotated music-to-palette dataset—and design a joint architecture comprising a music encoder and a color decoder to directly model auditory-to-visual emotional mapping. We further incorporate a multi-objective optimization strategy grounded in Russell’s circumplex model of affect, jointly optimizing emotional alignment, color diversity, and palette coherence. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches across multiple quantitative metrics. Moreover, it exhibits superior expressiveness and practical utility in downstream applications including music-driven image recoloring, video generation, and visual analytics.
📝 Abstract
Emotion alignment between music and palettes is crucial for effective multimedia content, yet misalignment creates confusion that weakens the intended message. However, existing methods often generate only a single dominant color, missing emotion variation. Others rely on indirect mappings through text or images, resulting in the loss of crucial emotion details. To address these challenges, we present Music2Palette, a novel method for emotion-aligned color palette generation via cross-modal representation learning. We first construct MuCED, a dataset of 2,634 expert-validated music-palette pairs aligned through Russell-based emotion vectors. To directly translate music into palettes, we propose a cross-modal representation learning framework with a music encoder and color decoder. We further propose a multi-objective optimization approach that jointly enhances emotion alignment, color diversity, and palette coherence. Extensive experiments demonstrate that our method outperforms current methods in interpreting music emotion and generating attractive and diverse color palettes. Our approach enables applications like music-driven image recoloring, video generating, and data visualization, bridging the gap between auditory and visual emotion experiences.