🤖 AI Summary
To address the degradation of clustering performance caused by the coexistence of homophilous and heterophilous neighborhoods in multimodal graphs, this paper proposes a decoupled multimodal graph clustering framework, DMGC. DMGC is the first method to explicitly disentangle homophilous relationships—enhancing intra-class semantic consistency—from heterophilous relationships—capturing inter-class associations—thereby constructing dual-view graph representations. It introduces a multimodal dual-frequency fusion mechanism, graph structure decomposition, and cross-modal consistency modeling, coupled with a self-supervised alignment objective to mitigate class confusion. Extensive experiments on multiple multimodal and multi-relational graph benchmarks demonstrate that DMGC consistently outperforms state-of-the-art methods, achieving new SOTA clustering performance. These results validate both its effectiveness in handling mixed neighborhood structures and its strong generalization capability across diverse multimodal graph scenarios.
📝 Abstract
Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. Through empirical analysis, we observe that real-world multimodal graphs often exhibit hybrid neighborhood patterns, combining both homophilic and heterophilic relationships. To address this challenge, we propose a novel framework -- extsc{Disentangled Multimodal Graph Clustering (DMGC)} -- which decomposes the original hybrid graph into two complementary views: (1) a homophily-enhanced graph that captures cross-modal class consistency, and (2) heterophily-aware graphs that preserve modality-specific inter-class distinctions. We introduce a emph{Multimodal Dual-frequency Fusion} mechanism that jointly filters these disentangled graphs through a dual-pass strategy, enabling effective multimodal integration while mitigating category confusion. Our self-supervised alignment objectives further guide the learning process without requiring labels. Extensive experiments on both multimodal and multi-relational graph datasets demonstrate that DMGC achieves state-of-the-art performance, highlighting its effectiveness and generalizability across diverse settings. Our code is available at https://github.com/Uncnbb/DMGC.