Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study addresses the problem of “silent failures” in scientific AI agents—specifically, their tendency to produce confidently stated yet physically incorrect outputs that appear plausible—in the context of astrophysical tasks. We introduce the first systematic evaluation framework for assessing the reliability of scientific AI agents, applying it to CMBAgent across two workflow paradigms (One-Shot and Deep Research) and 18 astrophysics tasks. Our methodology integrates domain-specific context injection, stress testing, and post-hoc consistency checks. Results demonstrate that incorporating domain context improves accuracy by approximately sixfold—from near-zero to 0.85—yet silent failures persist frequently in boundary-case reasoning scenarios, manifesting as unwarned violations of physical consistency. These findings reveal silent failure as the most hazardous failure mode in current scientific AI systems.

📝 Abstract

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.

Problem

Research questions and friction points this paper is trying to address.

Agentic AI

silent failure

scientific workflows

plausible but wrong

reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic AI

silent failure

scientific workflows