Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of efficiently and accurately evaluating generation quality in large language models during test-time scaling, a task often hindered by reliance on external reward models or noisy intrinsic signals. The authors propose a novel, supervision-free response selection mechanism that leverages the temporal clustering characteristics of high-entropy tokens during inference. Specifically, they model high-entropy phases (HEPs) as “entropy centroids,” whose positions reflect the balance between early-stage exploration and late-stage confidence. By selecting the output associated with the lowest entropy centroid, the method consistently outperforms existing baselines across diverse tasks—including mathematical reasoning, code generation, logical inference, and agent-based benchmarks—demonstrating strong effectiveness and scalability across model sizes ranging from 14B to 480B parameters.

📝 Abstract

An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.

Problem

Research questions and friction points this paper is trying to address.

intrinsic reward

test-time scaling

model uncertainty

entropy

response selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy Centroid

High Entropy Phase

intrinsic reward