CM$^3$: Calibrating Multimodal Recommendation

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing multimodal recommendation models suffer from an imbalance between embedding space alignment and uniformity—overemphasizing uniform distribution on the hypersphere manifold undermines the geometric proximity of semantically similar items. Method: We propose a spherical representation learning framework with multimodal similarity calibration, featuring: (1) a calibrated uniformity loss that dynamically adjusts item density on the hypersphere according to multimodal similarity; (2) a spherical Bessel fusion mechanism enabling geometrically consistent integration of multimodal features within the manifold space; and (3) incorporation of fine-grained semantic features extracted by multimodal large language models (MLLMs). Results: Evaluated on five real-world datasets, our method achieves up to a 5.4% improvement in NDCG@20 over state-of-the-art baselines, demonstrating the effectiveness of enhanced alignment and manifold-aware multimodal fusion.

Technology Category

Application Category

📝 Abstract

Alignment and uniformity are fundamental principles within the domain of contrastive learning. In recommender systems, prior work has established that optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the objectives of alignment and uniformity. Specifically, alignment aims to draw together the representations of interacting users and items, while uniformity mandates a uniform distribution of user and item embeddings across a unit hypersphere. This study revisits the alignment and uniformity properties within the context of multimodal recommender systems, revealing a proclivity among extant models to prioritize uniformity to the detriment of alignment. Our hypothesis challenges the conventional assumption of equitable item treatment through a uniformity loss, proposing a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold. Specifically, we leverage the inherent similarity between items' multimodal data to calibrate their uniformity distribution, thereby inducing a more pronounced repulsive force between dissimilar entities within the embedding space. A theoretical analysis elucidates the relationship between this calibrated uniformity loss and the conventional uniformity function. Moreover, to enhance the fusion of multimodal features, we introduce a Spherical Bézier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold. Empirical evaluations conducted on five real-world datasets substantiate the superiority of our approach over competing baselines. We also shown that the proposed methods can achieve up to a 5.4% increase in NDCG@20 performance via the integration of MLLM-extracted features. Source code is available at: https://github.com/enoche/CM3.

Problem

Research questions and friction points this paper is trying to address.

Addresses imbalance between alignment and uniformity in multimodal recommender systems

Proposes calibrated uniformity loss using multimodal similarity for better embeddings

Introduces Spherical Bézier method to fuse multimodal features effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibrates uniformity using multimodal similarity

Introduces Spherical Bézier for feature fusion

Enhances alignment with repulsive embedding forces

🔎 Similar Papers

No similar papers found.