🤖 AI Summary
This study addresses the challenge that current large language models struggle to accurately simulate authentic consumer responses in high-context Chinese consumption scenarios. To this end, we introduce ConsumerSimBench—the first fine-grained, auditable evaluation benchmark focused on real-world reaction patterns—leveraging 1,553 Chinese social media topics and 23,122 rule-validated response criteria to reformulate consumer simulation as a verifiable yes/no judgment task. Through a decomposed task design, multi-rater consistency validation, and a generate-and-reflect multi-agent reasoning pipeline, our approach significantly enhances evaluation reliability and annotation consistency. Experiments reveal that even the strongest existing model, Gemini-3.1-Pro, captures only 47.8% of genuine consumer reactions—substantially below human performance—and that a multi-agent method improves MiMo-V2.5-Pro’s coverage from 32.9% to 37.6%, underscoring a pronounced limitation in state-of-the-art models’ capacity for socially grounded consumer intuition.
📝 Abstract
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.