HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the heavy reliance on human experts in calibrating distributed hydrological models and the resulting difficulty in transferring calibration expertise across basins, which hinders efficient water resource management. To overcome this limitation, the authors propose HydroAgent, a framework that integrates supervised fine-tuning with reinforcement learning driven by feedback from an online hydrological simulator. Using the Nash–Sutcliffe efficiency coefficient as the reward signal, the approach optimizes a Qwen3-4B model via Group-Relative Policy Optimization. After fine-tuning on 2,576 expert calibration trajectories, HydroAgent significantly narrows the performance gap with human experts across four independent catchments, demonstrating the advantages of domain-specialized small language models in achieving both physical consistency and calibration efficiency.

📝 Abstract

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.

Problem

Research questions and friction points this paper is trying to address.

hydrologic model calibration

large language models

domain grounding

reinforcement learning

streamflow prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

simulator-grounded reinforcement learning

hydrologic model calibration

domain-tuned LLM agent