๐ค AI Summary
Traditional A/B testing is constrained by reliance on real user traffic, rendering it inefficient in low-traffic, multivariate, micro-optimization, and privacy-sensitive scenarios. This work proposes a persona-driven AI agent simulation framework that, for the first time, integrates large language modelโgenerated user profiles and behavioral preferences into A/B testing. By simulating user choices between design screenshots, the framework enables rapid, privacy-preserving end-to-end evaluation without dependence on live traffic, while supporting early feedback and interpretable analysis. Evaluated on 47 historical experiments, the method achieves an overall accuracy of 67%, rising to 83% in high-confidence cases, demonstrating robustness against naming and position biases and confirming that persona modeling enhances prediction accuracy.
๐ Abstract
A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.