🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) in understanding collective intent—specifically, their difficulty in extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discourse, despite strong performance on individual instructions. To bridge this gap, the authors introduce COINBench, the first dynamic evaluation benchmark for collective intent understanding, featuring a hierarchical cognitive architecture (COIN-TREE) and a retrieval-augmented verification mechanism (COIN-RAG), alongside a hybrid assessment framework combining rule-based metrics and LLM-as-Judge evaluation. Benchmarking 20 state-of-the-art LLMs reveals that current models largely succeed only at surface-level aggregation and struggle with deep synthesis of complex collective intentions, underscoring the need to evolve beyond mere instruction-following toward expert-level analytical agency.
📝 Abstract
Understanding human intent is a high-level cognitive challenge for Large Language Models (LLMs), requiring sophisticated reasoning over noisy, conflicting, and non-linear discourse. While LLMs excel at following individual instructions, their ability to distill Collective Intent - the process of extracting consensus, resolving contradictions, and inferring latent trends from multi-source public discussions - remains largely unexplored. To bridge this gap, we introduce COIN-BENCH, a dynamic, real-world, live-updating benchmark specifically designed to evaluate LLMs on collective intent understanding within the consumer domain. Unlike traditional benchmarks that focus on transactional outcomes, COIN-BENCH operationalizes intent as a hierarchical cognitive structure, ranging from explicit scenarios to deep causal reasoning. We implement a robust evaluation pipeline that combines a rule-based method with an LLM-as-the-Judge approach. This framework incorporates COIN-TREE for hierarchical cognitive structuring and retrieval-augmented verification (COIN-RAG) to ensure expert-level precision in analyzing raw, collective human discussions. An extensive evaluation of 20 state-of-the-art LLMs across four dimensions - depth, breadth, informativeness, and correctness - reveals that while current models can handle surface-level aggregation, they still struggle with the analytical depth required for complex intent synthesis. COIN-BENCH establishes a new standard for advancing LLMs from passive instruction followers to expert-level analytical agents capable of deciphering the collective voice of the real world. See our project page on COIN-BENCH.