Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study systematically investigates how inference-time computation (ITC) enhances the reasoning capabilities of large language models (LLMs), focusing on the quality-efficiency trade-offs of ITC strategies across reasoning- versus non-reasoning-oriented models. We propose verifier-free ITC scaling methods—including majority voting, Best-of-N sampling, and sequential self-refinement—and construct, for the first time, the quality-efficiency Pareto frontier. We observe that reasoning models generate shorter responses with fewer hedging tokens but richer discourse markers; leveraging these response features, we design a lightweight IITC filtering mechanism. Experiments show that reasoning models substantially outperform non-reasoning models; majority voting is robust and efficient but exhibits diminishing marginal returns; and our feature-based filtering improves accuracy by up to 8.2%.

Technology Category

Application Category

📝 Abstract

There is intense interest in investigating how inference time compute (ITC) (e.g. repeated sampling, refinements, etc) can improve large language model (LLM) capabilities. At the same time, recent breakthroughs in reasoning models, such as Deepseek-R1, unlock the opportunity for reinforcement learning to improve LLM reasoning skills. An in-depth understanding of how ITC interacts with reasoning across different models could provide important guidance on how to further advance the LLM frontier. This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. Specifically, we focus our research on verifier-free inference time-scaling methods due to its generalizability without needing a reward model. We construct the Pareto frontier of quality and efficiency. We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models. For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods like best-of-N and sequential revisions, while the additional inference compute offers minimal improvements. We further perform in-depth analyses of the association of key response features (length and linguistic markers) with response quality, with which we can improve the existing ITC methods. We find that correct responses from reasoning models are typically shorter and have fewer hedging and thinking markers (but more discourse markers) than the incorrect responses.

Problem

Research questions and friction points this paper is trying to address.

Investigates how inference-time compute enhances LLM capabilities

Compares reasoning and non-reasoning models on challenging tasks

Analyzes verifier-free methods for efficiency and quality trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-free inference-time-scaling for generalizability

Majority voting as robust inference strategy

Analyzing response features to improve ITC methods

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration