🤖 AI Summary
Existing Transformer-based hyperparameter optimization (HPO) methods heavily rely on large-scale historical trajectory datasets and lack efficient reinforcement learning mechanisms, resulting in poor cold-start capability and training instability. To address these limitations, we propose Trans-GRPO—a novel framework that introduces Groupwise Relative Policy Optimization (GRPO) to HPO for the first time, enabling end-to-end policy learning from scratch. We further design Policy Correlation Regularization (PCR) to enhance training stability and employ a Transformer architecture to model historical trajectories and generate high-quality hyperparameter configurations. Evaluated on the OpenML multi-task benchmark, Trans-GRPO significantly outperforms state-of-the-art methods across diverse tasks, demonstrating superior generalization, robustness, and data efficiency—particularly under low-data and cold-start regimes.
📝 Abstract
Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories and lack effective reinforcement learning (RL) techniques, thereby limiting their efficiency and performance improvements. Inspired by the success of Group Relative Policy Optimization (GRPO) in large language models (LLMs), we propose GRPOformer -- a novel hyperparameter optimization framework that integrates reinforcement learning (RL) with Transformers. In GRPOformer, Transformers are employed to generate new hyperparameter configurations from historical optimization trajectories, while GRPO enables rapid trajectory construction and optimization strategy learning from scratch. Moreover, we introduce Policy Churn Regularization (PCR) to enhance the stability of GRPO training. Experimental results on OpenML demonstrate that GRPOformer consistently outperforms baseline methods across diverse tasks, offering new insights into the application of RL for HPO.