ABTest: Behavior-Driven Testing for AI Coding Agents

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing AI programming agents lack systematic methods for evaluating robustness in diverse and adversarial scenarios. This work proposes ABTest—the first behavior-driven fuzz testing framework that automatically validates the robustness of AI coding agents by transforming real-world user-reported failures into repository-level behavioral tests. The approach innovatively distills 47 interaction patterns and 128 action types from 400 user reports to construct stepwise, repository-scale fuzzing templates, generating 647 test cases. Evaluation across three leading AI coding agents uncovered 1,573 behavioral anomalies, including 642 newly confirmed genuine failures, achieving a detection precision of 40.8%.

Technology Category

Application Category

📝 Abstract

AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47 Interaction Patterns and 128 Action types, generating 647 repository-grounded fuzzing cases. Executing the 647-case bundle once per evaluated configuration, ABTest flags 1,573 behavioral anomalies across the three coding agent families, of which 642 are manually confirmed as new true anomalies, achieving a detection precision of 40.8%. Our results demonstrate that ABTest effectively uncovers real-world failures, exposes robustness differences across models, and reveals previously unreported failure modes.

Problem

Research questions and friction points this paper is trying to address.

AI coding agents

robustness

behavioral testing

failure modes

fuzzing

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavior-driven testing

fuzzing framework

AI coding agents