Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing Pass@k metrics for evaluating reasoning capabilities are susceptible to random guessing under high sampling budgets—particularly in discrete answer spaces (e.g., mathematics)—leading to inflated estimates of true reasoning proficiency. To address this, we propose Cover@τ, a novel metric that explicitly incorporates a reliability threshold τ, defined as the proportion of generations in which the model outputs the correct answer with confidence ≥ τ. This enables precise characterization of a model’s trustworthy reasoning boundary. Within a reinforcement learning with verifiable rewards (RLVR) framework, we empirically analyze completion distribution over mathematical tasks. Experiments show that Cover@τ effectively distinguishes models reliant on stochastic sampling—whose performance sharply declines at high τ—and substantially alters the relative rankings of mainstream RLVR algorithms. Thus, Cover@τ establishes a more robust and interpretable paradigm for reasoning evaluation.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning boundaries beyond Pass@k metrics

Proposing Cover@tau to measure reliable problem-solving capability

Assessing model performance under explicit reliability thresholds

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR enhances reasoning with verifiable rewards

Cover@tau measures correctness proportion threshold

New metric avoids misleading random guessing effects

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering