🤖 AI Summary
This work proposes “prefix consistency,” a method to reduce the heavy reliance of large language models on extensive sampling and computational resources in reasoning tasks. By truncating the latter half of chain-of-thought (CoT) rationales and regenerating continuations, the approach leverages the observation that correct answers exhibit greater stability under regeneration. This stability serves as a reliability signal—without requiring access to token log-probabilities or self-evaluation prompts—to weight majority votes. Prefix consistency is the first method to use regeneration stability as an indicator of correctness, achieving state-of-the-art performance across five models and four mathematical and scientific benchmarks. Notably, it matches or exceeds the accuracy of standard self-consistency while using only 1/4.6 (median) and up to 1/21 of the tokens.
📝 Abstract
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.