🤖 AI Summary
This work proposes FITMM, a novel framework that addresses the limitations of existing spatial-domain approaches to multimodal recommendation, which often neglect frequency-domain structures and suffer from modality misalignment and redundancy. FITMM is the first to introduce the information bottleneck principle into frequency-domain multimodal recommendation. It constructs item representations via graph augmentation and performs orthogonal decomposition of each modality in the frequency domain to yield lightweight intra-band components. A task-adaptive gating mechanism fuses band-specific information, while intra-band independent modeling and cross-modal spectral consistency constraints enable adaptive band selection and redundancy suppression. Extensive experiments on three real-world datasets demonstrate that FITMM significantly outperforms state-of-the-art baselines, validating the effectiveness and generalizability of frequency-domain modeling for multimodal recommendation.
📝 Abstract
Multimodal recommendation aims to enhance user preference modeling by leveraging rich item content such as images and text. Yet dominant systems fuse modalities in the spatial domain, obscuring the frequency structure of signals and amplifying misalignment and redundancy. We adopt a spectral information-theoretic view and show that, under an orthogonal transform that approximately block-diagonalizes bandwise covariances, the Gaussian Information Bottleneck objective decouples across frequency bands, providing a principled basis for separate-then-fuse paradigm. Building on this foundation, we propose FITMM, a Frequency-aware Information-Theoretic framework for multimodal recommendation. FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition to obtain orthogonal bands, and forms lightweight within-band multimodal components. A residual, task-adaptive gate aggregates bands into the final representation. To control redundancy and improve generalization, we regularize training with a frequency-domain IB term that allocates capacity across bands (Wiener-like shrinkage with shut-off of weak bands). We further introduce a cross-modal spectral consistency loss that aligns modalities within each band. The model is jointly optimized with the standard recommendation loss. Extensive experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.