🤖 AI Summary
To address the low token efficiency of self-consistency methods in chain-of-thought reasoning, this paper proposes ConfCov: an early-hypothesis pruning framework that preserves parallelism. ConfCov jointly models the model’s internal confidence estimates and term-level coverage relationships among intermediate hypotheses, enabling dynamic identification and removal of redundant hypotheses via a lightweight weighted set cover algorithm. Crucially, it performs intermediate-layer pruning in a single forward pass while maintaining parallel generation of multiple reasoning paths—eliminating sequential backtracking. Experiments across three mathematical reasoning benchmarks with five large language models demonstrate that ConfCov reduces token consumption by 23.7% on average (ranging from 10% to 35%), significantly accelerates inference, and preserves accuracy. To our knowledge, this is the first work to jointly leverage confidence signals and lexical coverage for self-consistency pruning, achieving a favorable trade-off among efficiency, solution quality, and scalability.
📝 Abstract
Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model's own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.