🤖 AI Summary
This study systematically evaluates performance disparities between ChatGPT and DeepSeek-series large language models (LLMs) on scientific computing and scientific machine learning tasks—particularly partial differential equation (PDE) solving and neural operator learning—that demand sophisticated modeling decisions. Method: We conduct the first LLM benchmarking on PDE modeling and neural operator input-space design, introducing a unified evaluation framework grounded in prompt engineering for numerical method formulation, neural operator architecture generation, and joint assessment of PDE discretization fidelity and generalization capability. Contribution/Results: Contrary to expectations, inference-optimized models (e.g., DeepSeek-R1) do not consistently outperform state-of-the-art general-purpose models. ChatGPT o3-mini-high demonstrates superior accuracy, response latency, and task adaptability across all evaluated benchmarks, establishing itself as the most practical and efficient LLM for scientific computing applications to date.
📝 Abstract
Large Language Models (LLMs) have emerged as powerful tools for tackling a wide range of problems, including those in scientific computing, particularly in solving partial differential equations (PDEs). However, different models exhibit distinct strengths and preferences, resulting in varying levels of performance. In this paper, we compare the capabilities of the most advanced LLMs--ChatGPT and DeepSeek--along with their reasoning-optimized versions in addressing computational challenges. Specifically, we evaluate their proficiency in solving traditional numerical problems in scientific computing as well as leveraging scientific machine learning techniques for PDE-based problems. We designed all our experiments so that a non-trivial decision is required, e.g. defining the proper space of input functions for neural operator learning. Our findings reveal that the latest model, ChatGPT o3-mini-high, usually delivers the most accurate results while also responding significantly faster than its reasoning counterpart, DeepSeek R1. This enhanced speed and accuracy make ChatGPT o3-mini-high a more practical and efficient choice for diverse computational tasks at this juncture.