Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

How can key concepts and novel contributions in scientific papers be efficiently extracted and structured? This paper proposes a knowledge graph–large language model (LLM)–based approach for automated question-answer (QA) pair generation. First, a fine-tuned, fine-grained entity-relation extraction model constructs paper-level knowledge graphs. Second, a triplet significance scoring mechanism—inspired by TF-IDF—is introduced to prioritize innovative, domain-specific relations. Third, an LLM generates semantically precise,主旨-covering QA pairs grounded in the filtered triplets. Experimental evaluation and expert assessment demonstrate that our method significantly outperforms text-only baselines in main-idea capture accuracy and triplet quality, while preserving contextual awareness and interpretability. The framework establishes a new paradigm for rapid scholarly comprehension and intelligent academic literature retrieval.

Technology Category

Application Category

📝 Abstract

When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article's novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.

Problem

Research questions and friction points this paper is trying to address.

Extracting key concepts from scientific articles as QA pairs

Generating QAs using Knowledge Graphs and Large Language Models

Evaluating QA quality via expert assessments and novel metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates questions from salient paragraphs

Knowledge Graph leverages fine-tuned ER extraction

Triplet TF-IDF measures entity centrality

🔎 Similar Papers

No similar papers found.