Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

The rapid growth of astronomical literature and the high cost of manual annotation of key scientific entities (e.g., telescopes, instruments, semantic attributes) hinder scalable knowledge extraction. Method: We propose a SciBERT-based multi-task encoder model that jointly learns three entity recognition tasks via multi-task learning. To improve efficiency, we introduce random segment sampling for lightweight fine-tuning; at inference, robustness is enhanced through majority voting over segmented predictions. Contribution/Results: Our approach significantly outperforms open-weight GPT baselines across multiple astronomical text knowledge extraction tasks, achieving an average +4.2% F1 gain while reducing both training and inference overhead. The core contributions include: (i) a lightweight, domain-adapted multi-task architecture; (ii) empirical validation of structured fine-tuning and ensemble inference strategies for specialized scientific text; and (iii) a cost-effective technical pathway for constructing astronomical knowledge graphs.

Technology Category

Application Category

📝 Abstract

Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.

Problem

Research questions and friction points this paper is trying to address.

Automating key entity extraction from astronomy literature

Classifying telescope references and instrument mentions

Developing multi-task transformer models for astronomy corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned SciBERT model for astronomy text

Stochastic sampling from training data segments

Majority voting over test segments for inference

🔎 Similar Papers

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets