🤖 AI Summary
This work addresses a critical limitation in existing sampling-based decoding methods, which rely solely on token probabilities while neglecting the geometric relationships among tokens in the embedding space. This oversight leads to an “embedding space crowding” phenomenon that degrades performance on complex reasoning tasks. The study is the first to formally identify, quantify, and correlate this crowding effect with reduced success rates in mathematical reasoning. To mitigate this issue, the authors propose CraEG, a plug-and-play, geometry-guided sampling method that reweights sampling probabilities based on the intrinsic geometric structure of the embedding space. CraEG requires no additional training and operates with only a single forward pass, yet consistently improves both generation quality and diversity. Extensive experiments across multiple models and benchmarks demonstrate significant and robust gains in performance, particularly in reasoning accuracy, robustness, and output diversity.
📝 Abstract
Sampling-based decoding underlies complex reasoning in large language models (LLMs), where decoding strategies critically shape model behavior. Temperature- and truncation-based methods reshape the next-token distribution through global probability reweighting or thresholding to balance the quality-diversity tradeoff. However, they operate solely on token probabilities, ignoring fine-grained relationships among tokens in the embedding space. We uncover a novel phenomenon, embedding-space crowding, where the next-token distribution concentrates its probability mass on geometrically close tokens in the embedding space. We quantify crowding at multiple granularities and find a statistical association with reasoning success in mathematical problem solving. Motivated by this finding, we propose CraEG, a plug-and-play sampling method that mitigates crowding through geometry-guided reweighting. CraEG is training-free, single-pass, and compatible with standard sampling strategies. Experiments on multiple models and benchmarks demonstrate improved generation performance, with gains in robustness and diversity metrics.