🤖 AI Summary
Multimodal recommendation faces three key challenges: modality-specific noise, semantic inconsistency across modalities, and instability in graph-based message propagation. Existing spectral-domain methods lack both structural spectral reasoning capability and modality-adaptive reliability modeling. To address these, we propose a spectral-aware structured spectral reasoning framework: (1) graph-guided spectral decomposition for modality-specific band-wise representation learning; (2) band-wise reliability modulation and masking to suppress noisy frequency components; (3) low-rank cross-band cross-attention coupled with contrastive regularization to enhance semantic alignment; and (4) spectral-domain prediction consistency optimization for improved robustness. This work is the first to introduce structured spectral reasoning into multimodal recommendation. Extensive experiments on three real-world datasets demonstrate significant improvements over state-of-the-art methods, particularly under sparse and cold-start settings, with superior generalizability, robustness, and interpretability.
📝 Abstract
Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.