CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

To address the limitation of unimodal music similarity retrieval in modeling complex semantic relationships on streaming platforms, this paper proposes an LLM-augmented cross-modal contrastive learning framework. Our key contributions are: (1) a novel dual-source text data construction mechanism integrating web crawling and LLM-driven prompt generation; (2) a context-aware music description mining paradigm to alleviate the scarcity of high-quality text–music pairs; and (3) a noise-robust text–audio joint embedding architecture with online augmentation during training. Extensive evaluations—including objective metrics, human subjective assessments, and large-scale A/B testing on Huawei Music—demonstrate consistent superiority over state-of-the-art baselines: music similarity retrieval accuracy improves by 12.7%, and user click-through rate increases by 8.3%.

Technology Category

Application Category

📝 Abstract

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

Problem

Research questions and friction points this paper is trying to address.

Enhances music similarity retrieval using cross-modal learning

Addresses scarcity of text-music data via LLM-powered descriptions

Improves performance over benchmarks in real-world streaming platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal contrastive learning for music similarity

Dual-source data acquisition with LLM prompting

LLM-powered text description sourcing and mining

🔎 Similar Papers

No similar papers found.