🤖 AI Summary
This work addresses the limitations of existing generative music recommendation methods, which struggle to model cross-modal interactions when processing multimodal information independently, leading to redundant representations and suboptimal recommendation performance. To overcome this, we propose FusID, a novel framework that introduces a unified semantic ID through cross-modal joint encoding. FusID integrates contrastive representation learning with product quantization to generate discrete token sequences that are both conflict-free and highly discriminative. This approach effectively eliminates ID collisions, enhances embedding utilization, and explicitly captures multimodal synergies. Evaluated on the playlist continuation task, FusID achieves zero ID conflicts and significantly outperforms current baselines across key metrics, including MRR and Recall@k (k=1,5,10,20).
📝 Abstract
Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).