🤖 AI Summary
Existing LLM evaluation benchmarks lack systematic assessment of complex, dynamic, and multi-source conflicting intents—such as interwoven opinions, divergent goals, implicit assumptions, and affective biases—that characterize real-world consumer interactions.
Method: We introduce CUIBench, the first dynamic, real-time benchmark for Consumer Intent Understanding, featuring continuous updates and automated data governance to prevent data contamination. It formalizes a nonlinear, multi-perspective public-discourse understanding task, integrating techniques for multi-source signal fusion, inconsistency reasoning, and contextual evolution modeling. A fully automated data pipeline—encompassing web-scale collection, filtering, and refinement—generates a high-diversity, high-fidelity evaluation dataset.
Contribution/Results: CUIBench is the largest and most comprehensive benchmark for consumer intent understanding to date, covering unprecedented breadth in intent dimensions and real-world complexity. Empirical evaluation demonstrates substantial improvements in LLMs’ ability to parse nuanced, context-sensitive, and conflicting consumer intents.
📝 Abstract
Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce ench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. ench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.