FusID: Modality-Fused Semantic IDs for Generative Music Recommendation

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing generative music recommendation methods, which struggle to model cross-modal interactions when processing multimodal information independently, leading to redundant representations and suboptimal recommendation performance. To overcome this, we propose FusID, a novel framework that introduces a unified semantic ID through cross-modal joint encoding. FusID integrates contrastive representation learning with product quantization to generate discrete token sequences that are both conflict-free and highly discriminative. This approach effectively eliminates ID collisions, enhances embedding utilization, and explicitly captures multimodal synergies. Evaluated on the playlist continuation task, FusID achieves zero ID conflicts and significantly outperforms current baselines across key metrics, including MRR and Recall@k (k=1,5,10,20).

Technology Category

Application Category

📝 Abstract
Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
Problem

Research questions and friction points this paper is trying to address.

modality fusion
semantic IDs
generative recommendation
multimodal representation
ID conflict
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality fusion
semantic ID
generative recommendation
product quantization
multimodal representation
🔎 Similar Papers
No similar papers found.