CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of precisely controlling error rates in multi-output sampling with large language models when the answer set is unknown and stopping rules are data-dependent. To this end, the paper proposes CITE, the first algorithm that enables error-rate-controllable certification without requiring prior knowledge of answer categories and while accommodating arbitrary data-driven stopping rules. Built upon E-processes and intersection–union hypothesis testing, CITE provides anytime-valid statistical verification that a target answer is the unique mode of the response distribution, achieving minimax-optimal stopping time rates. Empirical results demonstrate that CITE maintains strict error rate control while substantially improving certification efficiency under long-tailed output distributions.

📝 Abstract

Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct from answer correctness. We propose the Certification by Intersection-union Testing with E-processes (CITE) algorithm, which provably controls false certification at any prescribed level under arbitrary data-driven stopping, without requiring prior knowledge of the answer category set. We also prove an category-set-size-free stopping-time rate, establish matching minimax lower bounds up to constants in the main regime, and extend the construction to confidence-weighted voting. Simulations and LLM self-consistency experiments show empirical error control and improved certification in diffuse-tail settings.

Problem

Research questions and friction points this paper is trying to address.

anytime-valid inference

LLM self-consistency

error control

data-dependent stopping

answer certification

Innovation

Methods, ideas, or system contributions that make the work stand out.

anytime-valid inference

self-consistency

E-processes