🤖 AI Summary
This work addresses the limitations of current large language models (LLMs) in generating mechanistic models under partially observable conditions and diverse task objectives, where outputs often suffer from unreliability, coding errors, and lack of validity. To overcome these challenges, we propose the Neural Integrated Mechanistic Modeling (NIMM) evaluation framework and develop the NIMMGen agent system, which leverages an iterative refinement mechanism between neural networks and mechanistic models to substantially enhance code correctness, empirical validity, and interpretability. Our approach provides the first systematic assessment of LLM-generated mechanistic models under realistic conditions, enabling reliable digital twin construction and counterfactual intervention simulation. Experiments across three scientific datasets demonstrate that models generated by NIMMGen exhibit strong counterfactual reasoning capabilities in complex scenarios.
📝 Abstract
Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mechanistic models are reliable in practice. To address this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) evaluation framework, which evaluates LLM-generated mechanistic models under realistic settings with partial observations and diversified task objectives. Our evaluation reveals fundamental challenges in current baselines, ranging from model effectiveness to code-level correctness. Motivated by these findings, we design NIMMgen, an agentic framework for neural-integrated mechanistic modeling that enhances code correctness and practical validity through iterative refinement. Experiments across three datasets from diversified scientific domains demonstrate its strong performance. We also show that the learned mechanistic models support counterfactual intervention simulation.