SciDef: Automating Definition Extraction from Academic Literature with Large Language Models

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the inefficiency of manual scientific definition extraction amid the exponential growth of academic literature by proposing the first end-to-end large language model (LLM) framework tailored for this task. The framework integrates multi-step prompt engineering with DSPy optimization strategies to automatically extract definitions from scholarly texts. To support evaluation and training, the authors introduce two high-quality, human-annotated datasets: DefExtra for definition extraction and DefSim for definition similarity. Experimental results demonstrate that the system achieves an extraction accuracy of 86.4% on the test set, confirming the effectiveness of the proposed approach. Furthermore, the study validates that natural language inference (NLI)-based metrics can reliably assess the quality of extracted definitions. Both the code and datasets are publicly released to foster further research in this area.

Technology Category

Application Category

📝 Abstract

Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra&DefSim, novel datasets of human-extracted definitions and definition-pairs'similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code&datasets are available at https://github.com/Media-Bias-Group/SciDef.

Problem

Research questions and friction points this paper is trying to address.

definition extraction

scientific literature

large language models

academic publishing

information retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

definition extraction

large language models

prompt engineering