🤖 AI Summary
This work addresses the critical risk of hallucinated outputs from large language models (LLMs) in environmental data management, particularly when irreversible operations—such as DOI assignment—are involved. To mitigate this, the authors propose EnviSmart, a novel system that intrinsically embeds reliability into its architectural design. EnviSmart employs a tripartite knowledge framework—separating externalized behaviors, domain knowledge, and skills—and a role-isolated multi-agent architecture. At key trust boundaries, it integrates deterministic verification and auditable handoff mechanisms to enable error isolation and rapid response. Evaluated on the SF2Bench benchmark, the system processed data from 2,452 sites within two days, successfully intercepting a pervasive coordinate transformation error affecting all sites. Typical faults were detected within 10 minutes with zero user exposure and fully remediated within 80 minutes.
📝 Abstract
Embedding LLM-driven agents into environmental FAIR data management is compelling - they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University's GIS Center Ecological Archive (849 curated datasets) serves as a single-agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow. The multi-agent approach improved both efficiency - completed by a single operator in two days with repeated artifact reuse across deployments - and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution. This paper has been accepted at PEARC 2026.