SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing topic discovery methods in scientific literature rely on word embeddings, limiting their capacity to model high-dimensional semantic relationships and deep contextual dependencies. To address this, we propose an LLM-enhanced end-to-end topic discovery framework. First, a large language model generates high-quality semantic triplets from scientific texts; an entropy-driven hard-negative sampling strategy is then employed to construct contrastive learning objectives. Subsequently, the text encoder is jointly optimized via triplet loss and contrastive loss, enhancing both topic discriminability and contextual sensitivity. Evaluated on three real-world scholarly datasets, our method consistently outperforms state-of-the-art approaches, achieving an average 5.2% improvement in topic clustering accuracy. Moreover, it enables fine-grained, interpretable analysis of thematic evolution over time. This work establishes a novel paradigm for scientific intelligence mining by integrating generative and discriminative capabilities of LLMs into unsupervised topic modeling.

Technology Category

Application Category

📝 Abstract
Topic discovery in scientific literature provides valuable insights for researchers to identify emerging trends and explore new avenues for investigation, facilitating easier scientific information retrieval. Many machine learning methods, particularly deep embedding techniques, have been applied to discover research topics. However, most existing topic discovery methods rely on word embedding to capture the semantics and lack a comprehensive understanding of scientific publications, struggling with complex, high-dimensional text relationships. Inspired by the exceptional comprehension of textual information by large language models (LLMs), we propose an advanced topic discovery method enhanced by LLMs to improve scientific topic identification, namely SciTopic. Specifically, we first build a textual encoder to capture the content from scientific publications, including metadata, title, and abstract. Next, we construct a space optimization module that integrates entropy-based sampling and triplet tasks guided by LLMs, enhancing the focus on thematic relevance and contextual intricacies between ambiguous instances. Then, we propose to fine-tune the textual encoder based on the guidance from the LLMs by optimizing the contrastive loss of the triplets, forcing the text encoder to better discriminate instances of different topics. Finally, extensive experiments conducted on three real-world datasets of scientific publications demonstrate that SciTopic outperforms the state-of-the-art (SOTA) scientific topic discovery methods, enabling researchers to gain deeper and faster insights.
Problem

Research questions and friction points this paper is trying to address.

Improving topic discovery in scientific literature using LLMs
Addressing limitations of word embedding in semantic understanding
Enhancing discrimination of complex text relationships in publications
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-enhanced textual encoder captures scientific content
Entropy sampling and triplet tasks optimize thematic space
Contrastive loss fine-tuning improves topic discrimination
🔎 Similar Papers
No similar papers found.
P
Pengjiang Li
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zaitian Wang
Zaitian Wang
Computer Network Information Center, Chinese Academy of Sciences
Data-centric AILarge Language Models
Xinhao Zhang
Xinhao Zhang
PHD student, Portland State University
Data MiningReinforcement Learning
R
Ran Zhang
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Lu Jiang
Lu Jiang
Research Scientist @ Apple
Generative AIFoundation ModelRobust Deep LearningMultimediaVideo Generation
P
Pengfei Wang
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Yuanchun Zhou
Yuanchun Zhou
Computer Network Information Center,CAS
Data MiningBig Data Analysis