Predicting New Concept-Object Associations in Astronomy by Mining the Literature

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work proposes a novel method for predicting yet-unpublished concept–celestial object associations from astronomical literature to inform the prioritization of telescope observation targets. By constructing a knowledge graph driven by astro-ph publications—integrating OCR-based text processing, SIMBAD entity alignment, and scientific concept clustering—the approach uniquely combines implicit-feedback matrix factorization (via Alternating Least Squares) with a KNN-based text embedding strategy that smooths concept semantic similarity. Experimental results demonstrate that the method outperforms the strongest baseline by 16.8% in NDCG@100 and 19.8% in Recall@100, significantly surpassing existing heuristic and neighborhood-based approaches. These findings substantiate that the structure of scientific discovery embedded within the literature is inherently predictable.

Technology Category

Application Category

📝 Abstract

We construct a concept-object knowledge graph from the full astro-ph corpus through July 2025. Using an automated pipeline, we extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then test whether historical graph structure can forecast new concept-object associations before they appear in print. Because the concepts are derived from clustering and therefore overlap semantically, we apply an inference-time concept-similarity smoothing step uniformly to all methods. Across four temporal cutoffs on a physically meaningful subset of concepts, an implicit-feedback matrix factorization model (alternating least squares, ALS) with smoothing outperforms the strongest neighborhood baseline (KNN using text-embedding concept similarity) by 16.8% on NDCG@100 (0.144 vs 0.123) and 19.8% on Recall@100 (0.175 vs 0.146), and exceeds the best recency heuristic by 96% and 88%, respectively. These results indicate that historical literature encodes predictive structure not captured by global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for scarce telescope time.

Problem

Research questions and friction points this paper is trying to address.

concept-object association

astronomy

literature mining

knowledge graph

prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge graph

matrix factorization

concept-object association