ABTest: Behavior-Driven Testing for AI Coding Agents

πŸ“… 2026-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing AI programming agents lack systematic methods for evaluating robustness in diverse and adversarial scenarios. This work proposes ABTestβ€”the first behavior-driven fuzz testing framework that automatically validates the robustness of AI coding agents by transforming real-world user-reported failures into repository-level behavioral tests. The approach innovatively distills 47 interaction patterns and 128 action types from 400 user reports to construct stepwise, repository-scale fuzzing templates, generating 647 test cases. Evaluation across three leading AI coding agents uncovered 1,573 behavioral anomalies, including 642 newly confirmed genuine failures, achieving a detection precision of 40.8%.
πŸ“ Abstract
AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47 Interaction Patterns and 128 Action types, generating 647 repository-grounded fuzzing cases. Executing the 647-case bundle once per evaluated configuration, ABTest flags 1,573 behavioral anomalies across the three coding agent families, of which 642 are manually confirmed as new true anomalies, achieving a detection precision of 40.8%. Our results demonstrate that ABTest effectively uncovers real-world failures, exposes robustness differences across models, and reveals previously unreported failure modes.
Problem

Research questions and friction points this paper is trying to address.

AI coding agents
robustness
behavioral testing
failure modes
fuzzing
Innovation

Methods, ideas, or system contributions that make the work stand out.

behavior-driven testing
fuzzing framework
AI coding agents
interaction patterns
anomaly detection
πŸ”Ž Similar Papers
No similar papers found.