SciML Agents: Write the Solver, Not the Solution

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) can serve as trustworthy scientific machine learning (SciML) agents capable of generating **numerically sound and executable code** for solving ordinary differential equations (ODEs) described in natural language, while autonomously diagnosing stiffness, selecting appropriate solvers, and verifying numerical stability. To this end, the authors introduce the first benchmark suite specifically designed to evaluate SciML agent capabilities—covering misleading diagnostic tasks and large-scale ODE problems. They propose domain-informed, guided prompting and fine-tuning strategies, augmented with an integrated numerical stability verification module. Experimental results demonstrate that state-of-the-art instruction-tuned models achieve high code executability and numerical accuracy under sufficient context; notably, several open-source LLMs attain strong performance without fine-tuning. These findings substantiate the feasibility of deploying LLMs as reliable agents for scientific computing.

Technology Category

Application Category

📝 Abstract

Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics-informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain-aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural-language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non-stiffness; and the large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction-following models achieve high accuracy on both criteria. In many cases, recent open-source systems perform strongly without fine-tuning, while older or smaller models still benefit from fine-tuning. Overall, our preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate scientifically valid ODE solver code

Creating benchmarks for scientific computing tasks with LLMs

Assessing code executability and numerical accuracy against reference solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate code for numerical algorithms

LLMs select suitable solvers with stability checks

Guided prompting and fine-tuning enhance accuracy

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation