Decoding-Free Sampling Strategies for LLM Marginalization

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Subword tokenization introduces bias in language model evaluation: standard methods compute probability over a single tokenization path, whereas ideal evaluation should marginalize over all valid subword segmentations. Existing sampling-based approximations rely on costly autoregressive LLM decoding, severely limiting sample size and estimation accuracy. This paper proposes the first decoding-free marginalization approximation method: it avoids invoking the LLM for generation and instead performs efficient sampling over the subword lattice using only tokenizer forward probabilities and combinatorial optimization. The approach is model- and tokenizer-agnostic, enabling lightweight, high-throughput sampling. Evaluated across multiple open-source large language models, our method achieves 10–100× speedup in sampling, delivers stable marginal probability estimates, substantially reduces inference cost, and consistently improves downstream task performance.

Technology Category

Application Category

📝 Abstract

Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs via marginalization over all tokenizations

Approximate marginalization requires expensive generation sampling

Proposing decoding-free sampling to reduce computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoding-free sampling eliminates LLM generation costs

Model-agnostic sampling uses cheap probability computations

Accurate marginalization achieved with minimal runtime overhead

🔎 Similar Papers

Approximately Aligned Decoding