🤖 AI Summary
This work addresses two key limitations in existing multimodal recommendation methods: their neglect of individual user differences in content relevance perception and their inability to capture high-order dependencies among modalities. To overcome these challenges, the authors propose the GTC framework, which first employs a user-conditional generative diffusion model to enable personalized filtering of multimodal content features. Subsequently, it explicitly models the joint multimodal dependencies under user perception by optimizing a lower bound on the total correlation of cross-modal representations, thereby transcending the constraints of conventional pairwise contrastive learning. Extensive experiments demonstrate that GTC significantly outperforms state-of-the-art baselines on standard benchmarks, achieving up to a 28.30% improvement in NDCG@5. Ablation studies further confirm the effectiveness of each proposed component.
📝 Abstract
Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.