UQ: Assessing Language Models on Unsolved Questions

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing AI benchmarks suffer from an imbalance between “difficulty” and “realism”: exam-style benchmarks lack practical relevance, while user-interaction benchmarks are overly simplistic. Method: We introduce UQ—the first dynamic evaluation platform grounded in unsolved questions from Stack Exchange—spanning authentic, complex domains including theoretical computer science, mathematics, and science fiction. UQ innovates with an “unsolved-question-driven” asynchronous evaluation paradigm and a multi-stage data curation pipeline integrating rule-based filtering, LLM-based discrimination, and human validation. It further proposes a composite verification strategy guided by the generation–verification gap, augmented by community-wide collective validation. Contribution/Results: UQ significantly enhances characterization of models’ frontier knowledge boundaries: state-of-the-art models pass verification on only 15% of questions, and preliminary human validation has already confirmed substantive advances beyond current capabilities.

Technology Category

Application Category

📝 Abstract

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI models on unsolved real-world questions

Addressing difficulty-realism tension in benchmark design

Assessing reasoning, factuality, and browsing capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsolved questions from Stack Exchange as testbed

Combining rule-based filters, LLM judges, human review

Open platform with expert verification for solutions

🔎 Similar Papers

No similar papers found.