Can LLMs Predict Academic Collaboration? Topology Heuristics vs. LLM-Based Link Prediction on Real Co-authorship Networks

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of predicting future research collaborations in the absence of explicit graph-structured information. Leveraging author metadata—such as research topics and institutional affiliations—the authors employ the Qwen2.5-72B-Instruct large language model for link prediction on the large-scale co-authorship network from OpenAlex. The approach is benchmarked against topological heuristics (e.g., Common Neighbors, Adamic-Adar) and node2vec embeddings. Experimental results demonstrate that the large language model effectively captures semantic signals like research concepts, achieving AUROC scores of 0.714–0.789 and a peak recall of 92.9% under natural class imbalance. Notably, in cold-start scenarios where collaborators share no common neighbors, the model attains an AUROC of 0.652, significantly outperforming conventional methods and highlighting its complementary value to topology-based approaches.
📝 Abstract
Can large language models (LLMs) predict which researchers will collaborate? We study this question through link prediction on real-world co-authorship networks from OpenAlex (9.96M authors, 108.7M edges), evaluating whether LLMs can predict future scientific collaborations using only author profiles, without access to graph structure. Using Qwen2.5-72B-Instruct across three historical eras of AI research, we find that LLMs and topology heuristics capture distinct signals and are strongest in complementary settings. On new-edge prediction under natural class imbalance, the LLM achieves AUROC 0.714--0.789, outperforming Common Neighbors, Jaccard, and Preferential Attachment, with recall up to 92.9\%; under balanced evaluation, the LLM outperforms \emph{all} topology heuristics in every era (AUROC 0.601--0.658 vs.\ best-heuristic 0.525--0.538); on continued edges, the LLM (0.687) is competitive with Adamic-Adar (0.684). Critically, 78.6--82.7\% of new collaborations occur between authors with no common neighbor -- a blind spot where all topology heuristics score zero but the LLM still achieves AUROC 0.652 by reasoning from author metadata alone. A temporal metadata ablation reveals that research concepts are the dominant signal (removing concepts drops AUROC by 0.047--0.084). Providing pre-computed graph features to the LLM \emph{degrades} performance due to anchoring effects, confirming that LLMs and topology methods should operate as separate, complementary channels. A socio-cultural ablation finds that name-inferred ethnicity and institutional country do not predict collaboration beyond topology, reflecting the demographic homogeneity of AI research. A node2vec baseline achieves AUROC comparable to Adamic-Adar, establishing that LLMs access a fundamentally different information channel -- author metadata -- rather than encoding the same structural signal differently.
Problem

Research questions and friction points this paper is trying to address.

link prediction
academic collaboration
large language models
co-authorship networks
topology heuristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based link prediction
co-authorship networks
topology heuristics
author metadata
research collaboration prediction
🔎 Similar Papers
No similar papers found.
Fan Huang
Fan Huang
Indiana University Bloomington
NLP
M
Munjung Kim
School of Data Science, University of Virginia