BRIDGE: Predicting Human Task Completion Time From Model Performance

πŸ“… 2026-02-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing approaches that rely on costly, noisy, and non-scalable human-annotated task completion times, and lack a principled mechanism to link AI performance with human-interpretable task difficulty. The authors propose BRIDGE, a novel framework that, for the first time, leverages two-parameter logistic item response theory (IRT) to jointly infer latent task difficulty and model ability from multi-benchmark performance data, establishing a linear log relationship between these latent traits and human completion times. Without requiring any human annotations, BRIDGE can predict human completion times for new tasks and provides a unified metric for assessing real-world AI capabilities. Experimental results reproduce METR’s exponential scaling law: the duration of solvable tasks by frontier models doubles every six months, with the 50% solvability frontier advancing rapidly, thereby validating the method’s effectiveness and predictive power.

Technology Category

Application Category

πŸ“ Abstract
Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE, a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time. Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months.
Problem

Research questions and friction points this paper is trying to address.

human task completion time
benchmark evaluation
task difficulty
AI capability assessment
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Item Response Theory
task difficulty estimation
human-AI alignment
model capability scaling
psychometric modeling