🤖 AI Summary
The explosive growth of scientific literature poses significant challenges for interdisciplinary knowledge integration. Method: This paper proposes a lightweight, LLM-driven approach to structured knowledge construction, integrating large language models with a reusable scientific concept ontology—avoiding costly retrieval-augmented generation or opaque semantic modeling. Using only 20 annotated abstracts, the method achieves cross-domain generalization of scientific concepts; it introduces a lightweight, interdisciplinary-compatible ontology schema ensuring interpretability and extensibility; and constructs a domain-spanning knowledge graph covering astrophysics, fluid dynamics, and evolutionary biology, scaled to 30,000 arXiv papers. Contribution/Results: The resulting system enables precise literature question answering and scientific trend analysis. All components—including the ontology, annotation guidelines, and graph construction pipeline—are fully open-sourced to support reproducible, transparent scholarly analysis.
📝 Abstract
The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.