Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of generating perceptually natural and emotionally coherent music from artistic images and user comments without explicit emotion annotations. To this end, we propose ArtiCaps, a pseudo-emotion-aligned multimodal dataset, and Art2Music—a lightweight end-to-end framework. Art2Music employs OpenCLIP for joint image-text encoding and introduces a gated residual fusion module to align cross-modal representations. Mel-spectrograms are decoded via a bidirectional LSTM, optimized using frequency-weighted L1 loss for spectral fidelity, followed by fine-tuned HiFi-GAN for high-fidelity waveform synthesis. Evaluated on ArtiCaps, our model achieves significant reductions in Mel-Cepstral Distortion (MCD) and Fréchet Audio Distance (FAD). Small-scale LLM-based assessment confirms strong cross-modal emotional consistency. With only 50K training samples, Art2Music delivers high-fidelity, semantically coherent music generation—demonstrating both computational efficiency and interpretability.

Technology Category

Application Category

📝 Abstract
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
Problem

Research questions and friction points this paper is trying to address.

Generates feeling-aligned music from artistic images and comments
Addresses the need for flexible methods without costly emotion labels
Enhances perceptual naturalness and semantic consistency in audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight cross-modal framework synthesizes music from images and text
Gated residual module fuses OpenCLIP encoded multimodal representations
Fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms from spectrograms
🔎 Similar Papers
No similar papers found.
J
Jiaying Hong
School of Computing, Newcastle University, Newcastle upon Tyne, UK
T
Ting Zhu
School of Computing, Newcastle University, Newcastle upon Tyne, UK
Thanet Markchom
Thanet Markchom
Department of Computer Science, University of Reading
Recommender SystemMachine LearningComputer VisionNatural Language Processing
Huizhi Liang
Huizhi Liang
Newcastle University
Data MiningMachine LearningPersonalizationRecommender Systems