🤖 AI Summary
While existing weighted majority voting methods such as CISC improve reasoning accuracy in large language models, they incur substantial computational overhead by requiring a judge model to generate confidence scores for every candidate answer. This work proposes VecCISC, a novel framework that integrates reasoning trajectory clustering with confidence-weighted self-consistency for the first time. By clustering reasoning paths based on semantic similarity and filtering out redundant, degenerate, or hallucinated trajectories, VecCISC drastically reduces the number of candidates requiring evaluation. Furthermore, it incorporates a lightweight adaptive mechanism that significantly lowers computational cost without compromising accuracy. Experimental results across five diverse benchmarks demonstrate that VecCISC reduces token consumption by 47% on average while matching or surpassing the accuracy of the original CISC method.
📝 Abstract
A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.