🤖 AI Summary
To address the limited capability of existing methods in jointly representing graph-structured and textual modalities, this paper proposes the first large language model (LLM)-based unified multimodal encoding framework. Methodologically, it (1) pioneers the direct use of an LLM as a joint graph–text encoder—not merely a text processor; (2) introduces a graph–text contrastive learning mechanism to explicitly align heterogeneous embedding spaces; and (3) employs a lightweight MLP adapter for cross-modal projection. Evaluated across six benchmark datasets spanning knowledge graph question answering, graph–text classification, and cross-modal retrieval, the framework consistently outperforms state-of-the-art approaches. It significantly enhances joint embedding quality and downstream task performance, establishing a novel paradigm for LLM-driven multimodal graph learning.
📝 Abstract
Graph-structured information offers rich contextual information that can enhance language models by providing structured relationships and hierarchies, leading to more expressive embeddings for various applications such as retrieval, question answering, and classification. However, existing methods for integrating graph and text embeddings, often based on Multi-layer Perceptrons (MLPs) or shallow transformers, are limited in their ability to fully exploit the heterogeneous nature of these modalities. To overcome this, we propose GT2Vec, a simple yet effective framework that leverages Large Language Models (LLMs) to jointly encode text and graph data. Specifically, GT2Vec employs an MLP adapter to project graph embeddings into the same space as text embeddings, allowing the LLM to process both modalities jointly. Unlike prior work, we also introduce contrastive learning to align the graph and text spaces more effectively, thereby improving the quality of learned joint embeddings. Empirical results across six datasets spanning three tasks, knowledge graph-contextualized question answering, graph-text pair classification, and retrieval, demonstrate that GT2Vec consistently outperforms existing baselines, achieving significant improvements across multiple datasets. These results highlight GT2Vec's effectiveness in integrating graph and text data. Ablation studies further validate the effectiveness of our method.