\$OneMillion-Bench: How Far are Language Agents from Human Experts?

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly focus on structured or exam-style tasks, which inadequately assess language agents’ long-term reasoning and tool-use capabilities in real-world, high-stakes professional settings. To address this gap, this work introduces an expert-level benchmark spanning law, finance, industry, healthcare, and natural sciences, comprising 400 complex tasks that require agents to retrieve authoritative sources, reconcile conflicting evidence, apply domain-specific rules, and make binding decisions. The study innovatively proposes a four-dimensional expert evaluation protocol—encompassing factual accuracy, logical coherence, practical feasibility, and professional compliance—that uniquely treats reasoning processes and final outputs as equally critical evaluation criteria. This approach substantially enhances the ability to differentiate agents based on their depth of professional expertise and readiness for real-world deployment.

Technology Category

Application Category

📝 Abstract
As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world professional demands. To this end, we introduce \$OneMillion-Bench \$OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios. Unlike prior work, the benchmark requires retrieving authoritative sources, resolving conflicting evidence, applying domain-specific rules, and making constraint decisions, where correctness depends as much on the reasoning process as the final answer. We adopt a rubric-based evaluation protocol scoring factual accuracy, logical coherence, practical feasibility, and professional compliance, focused on expert-level problems to ensure meaningful differentiation across agents. Together, \$OneMillion-Bench provides a unified testbed for assessing agentic reliability, professional depth, and practical readiness in domain-intensive scenarios.
Problem

Research questions and friction points this paper is trying to address.

language agents
benchmark
professional tasks
real-world scenarios
evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

language agents
professional benchmark
multi-step reasoning
rubric-based evaluation
domain-intensive tasks