๐ค AI Summary
This work identifies a critical yet previously unexamined issue in test-time sampling for language models: multiple reasoning trajectories often converge into a few stable but potentially incorrect โreasoning basins,โ causing majority voting to be dominated by erroneous answers. To address this, the authors propose ARBITER, a conservative evidence-augmentation framework that requires no external information and leverages only samples generated by the same model to improve answer selection by modeling interactions among reasoning basins. ARBITER includes two variants: a parameter-free version, ARBITER-ฮ, which refines majority priors using intra-model evidence, and ARBITER-Enc, which incorporates residual hidden states from complete solutions. Evaluated on benchmarks such as GSM8K and MMLU-HS-Math with models including Qwen3-4B and Llama-3.1-8B, ARBITER consistently yields performance gainsโe.g., boosting Llama-3.1-8B on MMLU-HS-Math from 78.5% to 82.5%, recovering approximately 22% of the oracle performance gap without any observed degradation.
๐ Abstract
When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-ฮ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.