When prompt perturbations break your A/B test: A valid statistical test for generative surveying

๐Ÿ“… 2026-05-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the high sensitivity of large language models to semantically equivalent prompt perturbations in generative survey settings, which undermines the reliability of conventional A/B testing. The work formalizes the structure of such perturbations for the first time and introduces a tailored permutation test adapted to this structure. It further characterizes the conditions under which standard nonparametric testsโ€”such as the sign test and Wilcoxon signed-rank testโ€”fail in this context. Through theoretical analysis and simulation experiments, the paper quantifies the impact of perturbations on effect estimation, demonstrating its pronounced sensitivity to model choice. Practical guidance is provided for allocating experimental budgets across roles, perturbation types, and replication counts. The proposed method exhibits superior statistical power and robustness in real-world applications.
๐Ÿ“ Abstract
Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.
Problem

Research questions and friction points this paper is trying to address.

generative surveying
prompt perturbations
statistical validity
LLM sensitivity
A/B testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

generative surveying
prompt perturbations
permutation test
statistical validity
LLM sensitivity
๐Ÿ”Ž Similar Papers
No similar papers found.