Unlocking Multi-Modal Potentials for Dynamic Text-Attributed Graph Representation

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient co-modeling of temporal, textual, and structural modalities and inadequate exploitation of local graph structure in dynamic text-attributed graphs (DyTAGs), this paper proposes a node-centric multimodal representation learning paradigm. Methodologically, it introduces: (1) a non-shared node-centric attention encoder to explicitly model the three heterogeneous modalities; (2) a theoretically grounded time–semantic symmetric alignment loss to ensure cross-modal consistency; and (3) a lightweight cross-modal adapter for efficient multimodal fusion. The approach is architecture-agnostic—fully compatible with mainstream dynamic graph models without requiring底层 architectural modifications. Evaluated on seven benchmark datasets across two downstream tasks (e.g., node classification and link prediction), it achieves substantial average performance gains—up to 33.62%—demonstrating both the effectiveness and generalizability of multimodal co-modeling for DyTAGs.

Technology Category

Application Category

📝 Abstract
Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing representation of dynamic text-attributed graphs
Integrating temporal, textual, and structural modalities
Improving node-centric modeling over edge-centric approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Node-centric multi-modal framework
Attention-based temporal-semantic encoders
Symmetric alignment loss integration
🔎 Similar Papers
No similar papers found.
Yuanyuan Xu
Yuanyuan Xu
University of New South Wales
Graph Neural NetworksBig Data
W
Wenjie Zhang
University of New South Wales, Sydney, Australia
Y
Ying Zhang
Zhejiang Gongshang University, Hangzhou, China
X
Xuemin Lin
Shanghai Jiao Tong University, Shanghai, China
X
Xiwei Xu
CSIRO Data61, Eveleigh, Australia