When prompt perturbations break your A/B test: A valid statistical test for generative surveying

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the high sensitivity of large language models to semantically equivalent prompt perturbations in generative survey settings, which undermines the reliability of conventional A/B testing. The work formalizes the structure of such perturbations for the first time and introduces a tailored permutation test adapted to this structure. It further characterizes the conditions under which standard nonparametric tests—such as the sign test and Wilcoxon signed-rank test—fail in this context. Through theoretical analysis and simulation experiments, the paper quantifies the impact of perturbations on effect estimation, demonstrating its pronounced sensitivity to model choice. Practical guidance is provided for allocating experimental budgets across roles, perturbation types, and replication counts. The proposed method exhibits superior statistical power and robustness in real-world applications.

📝 Abstract

Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.

Problem

Research questions and friction points this paper is trying to address.

generative surveying

prompt perturbations

statistical validity

LLM sensitivity

A/B testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative surveying

prompt perturbations

permutation test