Nonstandard Errors in AI Agents

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study investigates whether AI agents produce consistent empirical results when given identical data and research questions, and quantifies the “non-standard errors” arising from their divergent analytical choices. We deploy 150 autonomous coding agents based on Claude Sonnet 4.6 and Opus 4.6 to independently test six hypotheses concerning SPY market quality using NYSE TAQ data. A three-stage feedback protocol is introduced to assess how AI peer review and exemplar learning influence result consistency. Our findings reveal, for the first time, that AI agents exhibit significant and stable differences in “empirical style.” Exposure to exemplars reduces the interquartile range of estimates within metric families by 80–99%, yet convergence stems primarily from imitation rather than genuine understanding—some agents even switch metric families to align with target outcomes.

Technology Category

Application Category

📝 Abstract

We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.

Problem

Research questions and friction points this paper is trying to address.

nonstandard errors

AI agents

empirical reproducibility

analytical variation

measure choice

Innovation

Methods, ideas, or system contributions that make the work stand out.

nonstandard errors

AI coding agents

empirical convergence