🤖 AI Summary
This work addresses the limitations of existing test-time alignment methods, whose reliance on ad hoc rejection criteria—such as confidence thresholds—lacks theoretical grounding in scenarios involving linguistic ambiguity, thereby constraining alignment performance. To overcome this, the paper unifies implicit reward and nudging approaches within a single graphical model framework and introduces a novel rejection criterion based on conservative confidence games, which effectively mitigates misjudgments caused by language ambiguity. By integrating proxy-model guidance, graph-based sampling, and the proposed rejection mechanism, the method achieves significant improvements over current approaches across multiple benchmark datasets, demonstrating both the effectiveness and robustness of the new criterion.
📝 Abstract
Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.