Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large language models (LLMs) operate under an implicit “must-answer” assumption during inference, leading to overconfident and erroneous responses when uncertain. Method: This paper proposes Selective QA—a new paradigm wherein models respond only when confidence exceeds a dynamically determined threshold; otherwise, they abstain. We pioneer the integration of test-time scaling with response selectivity, designing a reasoning-path-based confidence modeling mechanism. Our approach employs multi-sampling strategies—including self-verification and beam-search reranking—to enable risk-aware, dynamic response decisions. Contribution/Results: We introduce a non-zero-risk response evaluation framework, breaking the traditional zero-risk (forced-response) constraint. On the Selective QA benchmark, our method simultaneously improves both F1 and Risk-Aware Accuracy, significantly enhancing confidence–correctness calibration. This advances trustworthy LLM inference by enabling calibrated, selective answering grounded in uncertainty estimation.

Technology Category

Application Category

📝 Abstract

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Problem

Research questions and friction points this paper is trying to address.

Improving selective question answering via test-time compute scaling

Examining confidence scores and risk thresholds for model responses

Developing evaluation protocols for non-zero risk response scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages test-time compute scaling for accuracy and confidence

Extracts confidence scores during reasoning for response thresholding

Extends evaluation metrics for non-zero risk settings

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

2024-08-19arXiv.orgCitations: 0

💼 Related Jobs

TL, Research Inference

OpenAI

$380K – $555K • Offers Equity

San Francisco

Authors to Follow