🤖 AI Summary
While large language models can generate highly accurate answers, their outputs are often difficult for weak verification systems to effectively validate, and existing approaches frequently sacrifice accuracy due to the so-called “readability tax.” This work proposes a novel paradigm that decouples correctness from verifiability: first training a high-accuracy solver and then employing a translator to convert its outputs into forms that are both faithful to the original solution and amenable to verification. We introduce the first prover–verifier game framework, formalizing the translation task as the search for a faithful–verifiable equilibrium. Through a two-stage training procedure and equilibrium analysis, our method ensures outputs remain accurate while becoming significantly more inspectable by weak verifiers, thereby effectively mitigating the readability tax without compromising original performance.
📝 Abstract
As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.