MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the issue of numerical non-determinism in large language models during zero-temperature BF16 batched inference, which causes discrepancies between single-sample and batched outputs. The authors observe that such errors are sparse and predominantly occur at decoding steps with small top-1/top-2 logit margins. Leveraging this insight, they propose a sparse verification mechanism that selectively triggers validation only at high-risk decoding steps and efficiently corrects inconsistencies via column-wise replacement in the KV cache. Evaluated on Llama-3.1-8B and Qwen2.5-14B, the method achieves 100% sequence-level determinism with a verification trigger rate below 19%, nearly doubling inference speed compared to full verification while maintaining cross-dataset policy transferability.

📝 Abstract

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

Problem

Research questions and friction points this paper is trying to address.

batch-invariant inference

token flip

deterministic decoding

LLM inference

reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

MarginGate

batch-invariant inference

sparse verification