CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language models (LLMs) lack rigorous evaluation for automating computational fluid dynamics (CFD) numerical experiments, hindering their adoption in scientific computing. Method: We propose the first comprehensive CFD-specific evaluation framework, featuring a three-dimensional assessment taxonomy—covering CFD knowledge comprehension, physics-informed numerical reasoning, and context-aware code generation—grounded in realistic application scenarios. It employs multi-dimensional quantitative metrics with strict validation of code executability, numerical solution accuracy, and convergence behavior. Contribution/Results: We open-source a benchmark dataset and evaluation toolkit comprising three core CFD tasks. Empirical evaluation reveals pervasive limitations in state-of-the-art LLMs, including weak physical consistency and poor numerical robustness. Our framework establishes a reproducible, extensible evaluation paradigm for deploying LLMs in complex physical system modeling and simulation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to automate complex numerical experiments in fluid dynamics

Assessing LLM performance on graduate-level CFD knowledge and physical reasoning

Measuring LLM capabilities in implementing context-dependent CFD workflows accurately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark suite for CFD LLM evaluation

Three components testing key competencies

Real-world grounded reproducible evaluation framework

🔎 Similar Papers

No similar papers found.