Training-Free Group Relative Policy Optimization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Large language models (LLMs) exhibit weak generalization in specialized domains—such as mathematical reasoning and web search—and struggle to efficiently integrate external tools and prompting strategies. Method: This paper proposes Group-wise Relative Policy Optimization (GRPO), a training-free framework that replaces parameter updates with token-level priors derived from empirical knowledge. GRPO leverages intra-group semantic advantages instead of scalar rewards and employs multi-round semantic distillation for high-quality policy guidance. Its core techniques include group-wise relative semantic analysis, empirical knowledge distillation, and lightweight token prior modeling, supporting dynamic prior injection during API calls. Results: With only dozens of real-world examples and zero fine-tuning, GRPO significantly improves DeepSeek-V3.1-Terminus’s out-of-domain performance—surpassing small LLMs fine-tuned on limited data—while offering high efficiency and low computational cost.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM agent performance without costly parameter updates

Addressing performance degradation in specialized real-world domains

Overcoming practical data scarcity and overfitting in LLM optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free GRPO enhances LLMs without parameter updates

Method uses group semantic advantage for experiential knowledge distillation

Learned token prior guides model via API integration

🔎 Similar Papers

No similar papers found.