The AI Consumer Index (ACE)

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited practical utility of state-of-the-art AI models in high-stakes consumer tasks—shopping, dining, gaming, and DIY. To bridge this gap, we introduce ACE, the first dedicated benchmark for evaluating consumer-oriented AI capabilities. We propose dynamic provenance verification, an automated method to assess whether model responses faithfully reflect retrieved web content, and integrate a hybrid human–automated, tiered evaluation framework that rigorously tests the factual accuracy and operational viability of critical elements (e.g., prices, URLs). We release a hidden test set of 400 instances and an open-source development set, filling a critical void in consumer-grade AI evaluation. Comprehensive evaluation of 10 SOTA models reveals a maximum overall score of only 56.1%; even the best-performing shopping model scores below 50%. Widespread hallucination and non-executable outputs confirm that current AI systems remain substantially inadequate for real-world consumer applications.

Technology Category

Application Category

📝 Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
Problem

Research questions and friction points this paper is trying to address.

Assess AI models on consumer task performance
Evaluate grounding in web sources to reduce hallucinations
Measure gap between model capabilities and user needs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AI Consumer Index benchmark for consumer tasks
Uses hidden test cases across four consumer activity domains
Employs novel grading methodology grounding responses in web sources
🔎 Similar Papers
No similar papers found.
J
Julien Benchek
Mercor
R
Rohit Shetty
Mercor
B
Benjamin Hunsberger
Mercor
A
Ajay Arun
Mercor
Z
Zach Richards
Mercor
B
Brendan Foody
Mercor
Osvald Nitski
Osvald Nitski
Product Manager at Mercor
Machine LearningNatural Language ProcessingArtificial Intelligence
Bertie Vidgen
Bertie Vidgen
Oxford, Mercor
EvalsMCP + RAGAlignment + SafetyContent Moderation