🤖 AI Summary
To address the challenges of low cross-modal correlations and strong noise in pre-trained features that degrade clustering performance on multimodal attributed graphs (MMAGs), this paper proposes a dual-graph filtering framework. The framework integrates graph signal processing with dual-graph collaborative filtering, incorporating a feature-level denoising module to suppress modality-specific noise and establishing a triple-level contrastive learning mechanism—spanning cross-modal, neighborhood, and community levels—jointly optimized under an instance-level contrastive loss. Its key innovation lies in embedding both denoising and multi-granularity contrastive learning directly into the graph filtering process, thereby significantly enhancing representation robustness and discriminability. Extensive experiments on eight benchmark datasets demonstrate consistent superiority over state-of-the-art multi-view and graph clustering methods in terms of ACC, NMI, and other metrics, validating both effectiveness and generalizability.
📝 Abstract
Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance.
Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels.