🤖 AI Summary
Large language models (LLMs) exhibit inconsistent and unreliable responses to questionnaire-style prompts, hindering their use in survey simulation and data annotation. Method: We propose a modular, open-source framework enabling no-code construction of computer-simulated surveys and annotation experiments. It integrates structured prompt engineering, systematic prompt perturbation design, and multidimensional consistency evaluation—comparing over 40 million simulated responses against human answers. Contribution/Results: We present the first empirical evidence that questionnaire structural features and generation strategies critically impact response consistency. By optimizing prompt presentation and output constraints, we significantly reduce computational cost while improving alignment between LLM outputs and authentic human responses. The framework is publicly released and designed for non-technical users, enhancing reproducibility and scalability of virtual surveys and annotation tasks.
📝 Abstract
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.