Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This study addresses the challenge of effectively integrating textual semantics and citation network structure to uncover the intrinsic organizational principles of large-scale scholarly publications. Leveraging the Web of Science dataset comprising 56 million papers, the work presents the first systematic integration of large language model (LLM)-derived text embeddings with citation graph topology to construct a unified semantic–structural knowledge graph. By establishing a coherent representation of multimodal heterogeneous data, the approach reveals natural disciplinary clusters and cross-domain linkage patterns, elucidating a self-organizing knowledge landscape shaped jointly by semantic content and citation relationships. This framework establishes a novel paradigm for knowledge discovery in ultra-large-scale academic corpora.

Technology Category

Application Category

📝 Abstract

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.

Problem

Research questions and friction points this paper is trying to address.

text embeddings

graph structure

semantic information

scientific publications

large-scale dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM embeddings

graph-text integration

Web of Science