🤖 AI Summary
Existing survey analysis tools exhibit poor compatibility with large language models (LLMs) and lack evidence-based guidance for structured representation of questionnaire data. Method: We propose QASU—a novel benchmark specifically designed to evaluate LLMs’ ability to understand questionnaire structure—featuring six structured reasoning tasks: answer retrieval, respondent statistics, multi-hop inference, among others. Through systematic experiments, we quantitatively assess the impact of six data serialization formats and multiple prompting strategies; we further introduce a lightweight “self-augmented prompting” technique that explicitly injects questionnaire structural knowledge. Contribution/Results: The optimal format-prompting combination improves accuracy by up to 8.8 percentage points over the second-best configuration. Self-augmented prompting yields average gains of 3–4 percentage points on critical tasks and substantially enhances LLM robustness across diverse questionnaire formats.
📝 Abstract
Millions of people take surveys every day, from market polls and academic studies to medical questionnaires and customer feedback forms. These datasets capture valuable insights, but their scale and structure present a unique challenge for large language models (LLMs), which otherwise excel at few-shot reasoning over open-ended text. Yet, their ability to process questionnaire data or lists of questions crossed with hundreds of respondent rows remains underexplored. Current retrieval and survey analysis tools (e.g., Qualtrics, SPSS, REDCap) are typically designed for humans in the workflow, limiting such data integration with LLM and AI-empowered automation. This gap leaves scientists, surveyors, and everyday users without evidence-based guidance on how to best represent questionnaires for LLM consumption. We address this by introducing QASU (Questionnaire Analysis and Structural Understanding), a benchmark that probes six structural skills, including answer lookup, respondent count, and multi-hop inference, across six serialization formats and multiple prompt strategies. Experiments on contemporary LLMs show that choosing an effective format and prompt combination can improve accuracy by up to 8.8% points compared to suboptimal formats. For specific tasks, carefully adding a lightweight structural hint through self-augmented prompting can yield further improvements of 3-4% points on average. By systematically isolating format and prompting effects, our open source benchmark offers a simple yet versatile foundation for advancing both research and real-world practice in LLM-based questionnaire analysis.