PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the absence of domain-specific evaluation benchmarks for large language models (LLMs) in petroleum engineering by introducing the first standardized assessment framework encompassing production, reservoir, and drilling engineering, comprising 1,200 multi-format questions. Data quality is ensured through a rigorous three-stage pipeline involving expert review, preprocessing, and quality filtering, followed by validation across multiple models. Systematic evaluations of leading Chinese and English LLMs are conducted under a unified API environment. Results reveal that models perform better on subjective than objective questions, achieving peak accuracies of 65.3% and 74.3% on multiple-choice and true/false items, respectively. Models such as Gemini-1.5-Pro attain overall scores of 72%–74%, with Chinese models excelling in multiple-choice tasks and international models showing slight advantages in short-answer responses. The benchmark demonstrates strong domain relevance, high discriminative power, and reproducibility.

📝 Abstract

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Petroleum Engineering

Benchmark

Domain-specific Evaluation

Model Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

PetroBench

domain-specific benchmark

large language models