🤖 AI Summary
This work addresses the limited practical utility of state-of-the-art AI models in high-stakes consumer tasks—shopping, dining, gaming, and DIY. To bridge this gap, we introduce ACE, the first dedicated benchmark for evaluating consumer-oriented AI capabilities. We propose dynamic provenance verification, an automated method to assess whether model responses faithfully reflect retrieved web content, and integrate a hybrid human–automated, tiered evaluation framework that rigorously tests the factual accuracy and operational viability of critical elements (e.g., prices, URLs). We release a hidden test set of 400 instances and an open-source development set, filling a critical void in consumer-grade AI evaluation. Comprehensive evaluation of 10 SOTA models reveals a maximum overall score of only 56.1%; even the best-performing shopping model scores below 50%. Widespread hallucination and non-executable outputs confirm that current AI systems remain substantially inadequate for real-world consumer applications.
📝 Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.