🤖 AI Summary
This study addresses the challenge of effectively integrating textual semantics and citation network structure to uncover the intrinsic organizational principles of large-scale scholarly publications. Leveraging the Web of Science dataset comprising 56 million papers, the work presents the first systematic integration of large language model (LLM)-derived text embeddings with citation graph topology to construct a unified semantic–structural knowledge graph. By establishing a coherent representation of multimodal heterogeneous data, the approach reveals natural disciplinary clusters and cross-domain linkage patterns, elucidating a self-organizing knowledge landscape shaped jointly by semantic content and citation relationships. This framework establishes a novel paradigm for knowledge discovery in ultra-large-scale academic corpora.
📝 Abstract
Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.