Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Large language models are prone to errors when generating complex reasoning such as mathematical proofs, and existing verifiers struggle to simultaneously ensure soundness (avoiding missed errors) and completeness (avoiding false rejections of correct reasoning) when integrated into feedback loops with reasoners due to distributional shift. This work proposes an online learning framework for training chain-of-thought verifiers, introducing—for the first time—an extended Littlestone dimension to characterize the boundary of verification errors. It formally characterizes the Pareto frontier between soundness and completeness and formulates a weighted optimization objective. By integrating ensembles of weak reasoners with a confidence control mechanism, the method yields a provably optimal verifier learning algorithm that significantly improves the overall accuracy of weak reasoners and enables the generation of reliable out-of-distribution proofs.

Technology Category

Application Category

📝 Abstract

Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.

Problem

Research questions and friction points this paper is trying to address.

online learnability

chain-of-thought verifiers

soundness

completeness

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

online learning

chain-of-thought verification

soundness-completeness trade-off