Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of effectively integrating textual semantics and citation network structure to uncover the intrinsic organizational principles of large-scale scholarly publications. Leveraging the Web of Science dataset comprising 56 million papers, the work presents the first systematic integration of large language model (LLM)-derived text embeddings with citation graph topology to construct a unified semantic–structural knowledge graph. By establishing a coherent representation of multimodal heterogeneous data, the approach reveals natural disciplinary clusters and cross-domain linkage patterns, elucidating a self-organizing knowledge landscape shaped jointly by semantic content and citation relationships. This framework establishes a novel paradigm for knowledge discovery in ultra-large-scale academic corpora.

Technology Category

Application Category

📝 Abstract
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
Problem

Research questions and friction points this paper is trying to address.

text embeddings
graph structure
semantic information
scientific publications
large-scale dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM embeddings
graph-text integration
Web of Science
semantic structure
large-scale scientific data
🔎 Similar Papers
No similar papers found.
T
Tim Kunt
Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
A
Annika Buchholz
Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
I
Imene Khebouri
Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
Thorsten Koch
Thorsten Koch
TU Berlin / Zuse Institute Berlin
MathematicsLinear ProgrammingInteger Programming
I
Ida Litzel
Digital Data and Information for Society, Science, and Culture, Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
T
Thi Huong Vu
Institute of Mathematics, Vietnam Academy of Science and Technology, 10072 Hanoi, Vietnam