Unlocking Multi-Modal Potentials for Dynamic Text-Attributed Graph Representation

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address insufficient co-modeling of temporal, textual, and structural modalities and inadequate exploitation of local graph structure in dynamic text-attributed graphs (DyTAGs), this paper proposes a node-centric multimodal representation learning paradigm. Methodologically, it introduces: (1) a non-shared node-centric attention encoder to explicitly model the three heterogeneous modalities; (2) a theoretically grounded time–semantic symmetric alignment loss to ensure cross-modal consistency; and (3) a lightweight cross-modal adapter for efficient multimodal fusion. The approach is architecture-agnostic—fully compatible with mainstream dynamic graph models without requiring底层 architectural modifications. Evaluated on seven benchmark datasets across two downstream tasks (e.g., node classification and link prediction), it achieves substantial average performance gains—up to 33.62%—demonstrating both the effectiveness and generalizability of multimodal co-modeling for DyTAGs.

Technology Category

Application Category

📝 Abstract

Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal edges alongside rich textual attributes. A prior approach to representing DyTAGs leverages pre-trained language models to encode text attributes and subsequently integrates them into dynamic graph models. However, it follows edge-centric modeling, as in dynamic graph learning, which is limited in local structures and fails to exploit the unique characteristics of DyTAGs, leading to suboptimal performance. We observe that DyTAGs inherently comprise three distinct modalities-temporal, textual, and structural-often exhibiting dispersed or even orthogonal distributions, with the first two largely overlooked in existing research. Building on this insight, we propose MoMent, a model-agnostic multi-modal framework that can seamlessly integrate with dynamic graph models for structural modality learning. The core idea is to shift from edge-centric to node-centric modeling, fully leveraging three modalities for node representation. Specifically, MoMent presents non-shared node-centric encoders based on the attention mechanism to capture global temporal and semantic contexts from temporal and textual modalities, together with local structure learning, thus generating modality-specific tokens. To prevent disjoint latent space, we propose a symmetric alignment loss, an auxiliary objective that aligns temporal and textual tokens, ensuring global temporal-semantic consistency with a theoretical guarantee. Last, we design a lightweight adaptor to fuse these tokens, generating comprehensive and cohesive node representations. We theoretically demonstrate that MoMent enhances discriminative power over exclusive edge-centric modeling. Extensive experiments across seven datasets and two downstream tasks show that MoMent achieves up to 33.62% improvement against the baseline using four dynamic graph models.

Problem

Research questions and friction points this paper is trying to address.

Enhancing representation of dynamic text-attributed graphs

Integrating temporal, textual, and structural modalities

Improving node-centric modeling over edge-centric approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Node-centric multi-modal framework

Attention-based temporal-semantic encoders

Symmetric alignment loss integration

🔎 Similar Papers

No similar papers found.