🤖 AI Summary
Large language models (LLMs) lack rigorous evaluation for automating computational fluid dynamics (CFD) numerical experiments, hindering their adoption in scientific computing. Method: We propose the first comprehensive CFD-specific evaluation framework, featuring a three-dimensional assessment taxonomy—covering CFD knowledge comprehension, physics-informed numerical reasoning, and context-aware code generation—grounded in realistic application scenarios. It employs multi-dimensional quantitative metrics with strict validation of code executability, numerical solution accuracy, and convergence behavior. Contribution/Results: We open-source a benchmark dataset and evaluation toolkit comprising three core CFD tasks. Empirical evaluation reveals pervasive limitations in state-of-the-art LLMs, including weak physical consistency and poor numerical robustness. Our framework establishes a reproducible, extensible evaluation paradigm for deploying LLMs in complex physical system modeling and simulation.
📝 Abstract
Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.