Foundations of Top-$k$ Decoding For Language Models

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of rigorous theoretical foundations for top-k decoding. We establish the first unified theoretical framework by modeling each decoding step as a sparse probability distribution recovery problem, formulated as separable Bregman divergence minimization with ℓ₀ regularization—thereby unifying diverse sampling strategies under a single optimization principle. We prove, for the first time from a Bregman divergence primal-dual perspective, that top-k constitutes the optimal truncation solution under KL divergence. We further propose a semantic-aware nonlinear reweighting decoding paradigm and leverage discrete convexity together with binary search to compute the optimal k in polynomial time. Our results not only provide a rigorous justification for top-k’s empirical success but also yield a novel family of decoding strategies and reveal how distinct Bregman divergences induce fundamentally different reweighting mechanisms.

Technology Category

Application Category

📝 Abstract
Top-$k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-$k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-$k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-$k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the emph{primal} and emph{dual} cases) with a sparsity-inducing $ell_0$ regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in $k$, so that binary search provably and efficiently finds the optimal $k$. We show that top-$k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).
Problem

Research questions and friction points this paper is trying to address.

Lack of theoretical foundation for top-k decoding in LLMs
Need to generalize top-k decoding via Bregman divergence framework
Efficient optimization of sparse probability distribution recovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework for top-k decoding generalization
Efficient optimization with Bregman divergence and sparsity
Greedy strategies and binary search for optimal k
🔎 Similar Papers
No similar papers found.