🤖 AI Summary
This paper addresses the degradation of model generalization and robustness caused by excessively low entropy in output distributions. We propose Entropy-Regularized Activation (ERA), a novel paradigm that introduces a differentiable, parameterized output activation function to explicitly enforce a lower bound on the sampling entropy of predicted distributions. By elevating the activation function to an entropy control module, ERA achieves consistent performance gains across diverse domains with minimal overhead (<7% computational cost). Specifically, it improves Qwen2.5-Math-7B’s score on AIME 2025 by 37.4%, surpasses baselines such as SAC by over 30% in policy performance on HumanoidBench, and boosts ResNet-50’s top-1 accuracy on ImageNet by 0.69%. Crucially, ERA requires no architectural modifications or changes to training objectives, offering a task-agnostic, low-overhead, and highly generalizable entropy-aware optimization framework for continuous control, large language model inference, and image classification.
📝 Abstract
We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains: 1) for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.