Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This study addresses the challenge that current large language models struggle to accurately simulate authentic consumer responses in high-context Chinese consumption scenarios. To this end, we introduce ConsumerSimBench—the first fine-grained, auditable evaluation benchmark focused on real-world reaction patterns—leveraging 1,553 Chinese social media topics and 23,122 rule-validated response criteria to reformulate consumer simulation as a verifiable yes/no judgment task. Through a decomposed task design, multi-rater consistency validation, and a generate-and-reflect multi-agent reasoning pipeline, our approach significantly enhances evaluation reliability and annotation consistency. Experiments reveal that even the strongest existing model, Gemini-3.1-Pro, captures only 47.8% of genuine consumer reactions—substantially below human performance—and that a multi-agent method improves MiMo-V2.5-Pro’s coverage from 32.9% to 37.6%, underscoring a pronounced limitation in state-of-the-art models’ capacity for socially grounded consumer intuition.
📝 Abstract
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.
Problem

Research questions and friction points this paper is trying to address.

LLMs
consumer reaction
public discourse
reaction reconstruction
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConsumerSimBench
reaction reconstruction
socially grounded evaluation
multi-agent reasoning
public discourse simulation