Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge that large language models face in balancing diversity, creativity, and logical consistency during open-ended generation. Existing truncation-based sampling methods overlook the semantic geometric structure of token embedding spaces. To overcome this limitation, the authors propose Top-W decoding, which introduces the Wasserstein distance into truncation sampling for the first time. By explicitly modeling the geometry of the embedding space, Top-W optimizes set entropy while preserving probability mass. Theoretically, the optimal truncation set admits a closed-form solution—either a singleton or a one-dimensional prefix—and an efficient linear-scan algorithm is devised accordingly. The method maintains compatibility with standard sampling interfaces and integrates Wasserstein regularization, embedding geometric potentials, and an alternating decoding mechanism. Experiments on GSM8K, GPQA, AlpacaEval, and MT-Bench demonstrate that Top-W outperforms state-of-the-art approaches by up to 33.7%, significantly enhancing both accuracy and creativity in generated text.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

Problem

Research questions and friction points this paper is trying to address.

large language models

truncation sampling

semantic geometry

Wasserstein distance

open-ended generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wasserstein distance

geometry-aware decoding

truncation sampling