🤖 AI Summary
The rapid growth of astronomical literature and the high cost of manual annotation of key scientific entities (e.g., telescopes, instruments, semantic attributes) hinder scalable knowledge extraction.
Method: We propose a SciBERT-based multi-task encoder model that jointly learns three entity recognition tasks via multi-task learning. To improve efficiency, we introduce random segment sampling for lightweight fine-tuning; at inference, robustness is enhanced through majority voting over segmented predictions.
Contribution/Results: Our approach significantly outperforms open-weight GPT baselines across multiple astronomical text knowledge extraction tasks, achieving an average +4.2% F1 gain while reducing both training and inference overhead. The core contributions include: (i) a lightweight, domain-adapted multi-task architecture; (ii) empirical validation of structured fine-tuning and ensemble inference strategies for specialized scientific text; and (iii) a cost-effective technical pathway for constructing astronomical knowledge graphs.
📝 Abstract
Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.