PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of domain-specific evaluation benchmarks for large language models (LLMs) in petroleum engineering by introducing the first standardized assessment framework encompassing production, reservoir, and drilling engineering, comprising 1,200 multi-format questions. Data quality is ensured through a rigorous three-stage pipeline involving expert review, preprocessing, and quality filtering, followed by validation across multiple models. Systematic evaluations of leading Chinese and English LLMs are conducted under a unified API environment. Results reveal that models perform better on subjective than objective questions, achieving peak accuracies of 65.3% and 74.3% on multiple-choice and true/false items, respectively. Models such as Gemini-1.5-Pro attain overall scores of 72%–74%, with Chinese models excelling in multiple-choice tasks and international models showing slight advantages in short-answer responses. The benchmark demonstrates strong domain relevance, high discriminative power, and reproducibility.
📝 Abstract
Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Petroleum Engineering
Benchmark
Domain-specific Evaluation
Model Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

PetroBench
domain-specific benchmark
large language models
petroleum engineering
multi-model evaluation
🔎 Similar Papers
No similar papers found.
Xiang Wang
Xiang Wang
University of Science and Technology of China
Trustworthy AIGraph LearningRecommendationFoundation ModelsMultimodal Models
T
Tingting Zhang
School of Petroleum and Natural Gas Engineering, Changzhou University, Changzhou 213164, China
S
Sen Wang
China University of Petroleum (East China), Qingdao, Shandong 266580, China
Ying Wu
Ying Wu
Professor of Electrical and Computer Engineering, Northwestern University
computer vision and pattern recognition
H
Heng Meng
School of Petroleum and Natural Gas Engineering, Changzhou University, Changzhou 213164, China
P
Peng Zhou
School of Petroleum and Natural Gas Engineering, Changzhou University, Changzhou 213164, China
P
Peng Li
School of Petroleum and Natural Gas Engineering, Changzhou University, Changzhou 213164, China