🤖 AI Summary
This study addresses the absence of domain-specific evaluation benchmarks for large language models (LLMs) in petroleum engineering by introducing the first standardized assessment framework encompassing production, reservoir, and drilling engineering, comprising 1,200 multi-format questions. Data quality is ensured through a rigorous three-stage pipeline involving expert review, preprocessing, and quality filtering, followed by validation across multiple models. Systematic evaluations of leading Chinese and English LLMs are conducted under a unified API environment. Results reveal that models perform better on subjective than objective questions, achieving peak accuracies of 65.3% and 74.3% on multiple-choice and true/false items, respectively. Models such as Gemini-1.5-Pro attain overall scores of 72%–74%, with Chinese models excelling in multiple-choice tasks and international models showing slight advantages in short-answer responses. The benchmark demonstrates strong domain relevance, high discriminative power, and reproducibility.
📝 Abstract
Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.