Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates alignment biases in large language models (LLMs) when simulating human social behaviors—such as opinions and decisions—in multiple-choice surveys, particularly within economics and marketing contexts. We propose the first statistical hypothesis-testing framework for quantifying LLM–human behavioral alignment, integrating multi-group proportion comparisons, confidence interval analysis, and behavioral consistency metrics to systematically detect systematic deviations across demographic dimensions—including race, age, and income—on contentious topics. Our key contribution is the novel application of rigorous statistical inference to evaluate LLM fidelity in modeling human social behavior. Empirical results across mainstream models (e.g., GPT-4, Claude 3, Llama 3) demonstrate statistically significant misalignment with real-world population distributions on sensitive issues (p < 0.01), indicating that current LLMs lack sufficient representational validity for high-fidelity social science research.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.

Problem

Research questions and friction points this paper is trying to address.

Quantify misalignment between LLM-simulated and human behaviors

Assess LLM suitability for simulating human decision-making

Evaluate LLM alignment with diverse sub-populations in surveys

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypothesis testing for LLM-human misalignment quantification

Quantitative framework for multiple-choice behavior assessment

Evaluating LLM suitability across diverse demographics

🔎 Similar Papers

No similar papers found.