Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient evaluation of large language models (LLMs) on scientific text classification tasks. We present the first systematic benchmark of DeepSeek-R1 and GPT-4o on sentence-level scientific relation classification. To this end, we construct a high-quality, cross-disciplinary dataset of cleaned scientific text and propose a prompt-driven evaluation framework specifically designed for scientific relation identification—supporting zero-shot and few-shot classification. Our methodology integrates Web API invocation with structured output parsing to enable multi-dimensional analysis of consistency and robustness. Results show that GPT-4o exhibits superior stability in fine-grained relation recognition, while DeepSeek-R1 demonstrates notable potential in high-term-density contexts; both models are highly sensitive to prompt template design. This study fills a critical gap in the adaptability assessment of open-source scientific LLMs and validates the effectiveness and generalizability of our proposed evaluation paradigm.

Technology Category

Application Category

📝 Abstract
This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.
Problem

Research questions and friction points this paper is trying to address.

Evaluate GPT-4o and DeepSeek R1 for scientific text categorization.
Develop a new evaluation method for scientific text classification.
Compare model performance using a diverse scientific dataset.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes GPT-4o and DeepSeek R1 for text categorization
Introduces new evaluation method for scientific texts
Compiles diverse domain dataset for model comparison
🔎 Similar Papers
No similar papers found.
Aniruddha Maiti
Aniruddha Maiti
West Virginia State University
Artificial IntelligenceDeep LearningNLPData ScienceAI & Data Science in Medical Domain
S
Samuel Adewumi
Department of Mathematics, Engineering, and Computer Science, West Virginia State University, Institute, WV, USA
T
Temesgen Alemayehu Tikure
Department of Mathematics, Engineering, and Computer Science, West Virginia State University, Institute, WV, USA
Zichun Wang
Zichun Wang
Student, West Virginia State University, U.S.A.
AIMachine LearningLLM
N
Niladri Sengupta
Fractal Analytics Inc, USA
A
Anastasiia Sukhanova
Department of Computer Sciences and Electrical Engineering, Marshall University, Huntington, WV, USA
Ananya Jana
Ananya Jana
Assistant Professor, Marshall University
Deep LearningArtificial IntelligenceBiomedical ImagingComputer VisionMachine Learning