Scalable Scientific Interest Profiling Using Large Language Models

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the problem of outdated research interest profiles for scholars, this paper proposes a MeSH-based large language model (LLM) method for automated profile generation. Leveraging GPT-4o-mini, the approach extracts structured semantic features from PubMed literature to generate academic interest profiles characterized by high readability and strong semantic consistency. Compared with conventional abstract-based summarization, our method significantly improves conceptual accuracy and human readability: expert evaluation shows 77.78% of MeSH-derived profiles rated “good/excellent,” 93.44% achieved superior readability, and 67.86% of domain experts preferred the MeSH-based profiles; semantic similarity (BERTScore F1) reached 0.542—substantially outperforming the abstract baseline. This work presents the first systematic validation of a MeSH-driven LLM profiling paradigm, demonstrating both methodological innovation and practical efficacy in automated scholarly profile construction.

Technology Category

Application Category

📝 Abstract
Research profiles help surface scientists' expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers' self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.
Problem

Research questions and friction points this paper is trying to address.

Generating scientific interest profiles using large language models
Comparing MeSH-based and abstract-based profile generation methods
Evaluating machine-generated profiles against self-written researcher profiles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to generate scientific interest profiles
Comparing MeSH-based and abstract-based profile generation
Evaluating profiles with automatic metrics and human review
🔎 Similar Papers
No similar papers found.
Y
Yilun Liang
Department of Biomedical Informatics, Columbia University, New York, NY , USA
Gongbo Zhang
Gongbo Zhang
School of Electronic and Computer Engineering, Peking University
AI for ScienceMachine LearningGenerative Model
Edward Sun
Edward Sun
University of California, Los Angeles
AI for ScienceAgentsRobotics
B
Betina Idnay
Department of Biomedical Informatics, Columbia University, New York, NY , USA
Y
Yilu Fang
Department of Biomedical Informatics, Columbia University, New York, NY , USA
Fangyi Chen
Fangyi Chen
Research Scientist, ByteDance
Deep LearningMultimodal LLMObject Detection
C
Casey Ta
Department of Biomedical Informatics, Columbia University, New York, NY , USA
Y
Yifan Peng
Henry Samueli School of Engineering and Applied Science, University of California, Los Angeles, CA, USA
Chunhua Weng
Chunhua Weng
Professor, Columbia University
Biomedical InformaticsClinical Research Informatics