🤖 AI Summary
This work addresses the inefficiency of Monte Carlo policy optimization when large language models serve as evaluators by proposing Frost Training. The method introduces, for the first time, gradient signals from the embedding space—inspired by Greedy Coordinate Gradient (GCG)—into policy optimization, integrating them within the GRPO training framework. This approach enables highly efficient training on Cross-Entropy Games tasks, such as maximum-likelihood fill-in-the-blank problems. By leveraging gradient information of the reward function in the embedding space, Frost Training substantially accelerates the generation of high-scoring outputs, achieving higher peak scores more rapidly in best-of-k evaluation settings and thereby enhancing the model’s capacity to produce high-quality text.
📝 Abstract
We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.