enhancing reasoning accuracy in large language models during inference time

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the unreliability of large language models in multi-step reasoning tasks, particularly without additional training. It presents a systematic evaluation of three inference-time strategies—self-consistency (based on stochastic decoding), dual-model reasoning consistency, and self-reflection—all integrated with chain-of-thought (CoT) prompting to generate intermediate reasoning steps. For the first time, these approaches are compared under a unified prompting and verification framework, enabling a controlled analysis of their respective strengths and performance boundaries. Experimental results demonstrate that self-consistency improves accuracy by 9%–15% in low-stakes settings, while the dual-model approach offers superior reliability in moderate-risk scenarios; self-reflection yields limited gains. This work provides empirical grounding and practical guidance for selecting reasoning strategies based on task requirements.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.

Problem

Research questions and friction points this paper is trying to address.

reasoning accuracy

large language models

multi-step reasoning

inference-time

Chain-of-Thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time reasoning

self-consistency

dual-model agreement