The AI Consumer Index (ACE)

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the limited practical utility of state-of-the-art AI models in high-stakes consumer tasks—shopping, dining, gaming, and DIY. To bridge this gap, we introduce ACE, the first dedicated benchmark for evaluating consumer-oriented AI capabilities. We propose dynamic provenance verification, an automated method to assess whether model responses faithfully reflect retrieved web content, and integrate a hybrid human–automated, tiered evaluation framework that rigorously tests the factual accuracy and operational viability of critical elements (e.g., prices, URLs). We release a hidden test set of 400 instances and an open-source development set, filling a critical void in consumer-grade AI evaluation. Comprehensive evaluation of 10 SOTA models reveals a maximum overall score of only 56.1%; even the best-performing shopping model scores below 50%. Widespread hallucination and non-executable outputs confirm that current AI systems remain substantially inadequate for real-world consumer applications.

Technology Category

Application Category

📝 Abstract

We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.

Problem

Research questions and friction points this paper is trying to address.

Assess AI models on consumer task performance

Evaluate grounding in web sources to reduce hallucinations

Measure gap between model capabilities and user needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AI Consumer Index benchmark for consumer tasks

Uses hidden test cases across four consumer activity domains

Employs novel grading methodology grounding responses in web sources

🔎 Similar Papers

How VADER is your AI? Towards a definition of artificial intelligence systems appropriate for regulation