Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the end-to-end optimization challenge for multi-module language model programs—comprising LLM calls and external tools. We propose mmGRPO, the first extension of Group Relative Policy Optimization (GRPO) to modular systems, enabling grouped policy-gradient updates per module and supporting variable-length and truncated trajectories. Integrated with automated prompt optimization, mmGRPO is implemented as the `dspy.GRPO` optimizer within the DSPy framework. Its core innovation lies in joint cross-module policy learning coupled with co-optimization of prompts and execution logic. Evaluated on classification, multi-hop search, and privacy-preserving delegation tasks, mmGRPO achieves an average accuracy improvement of 11% over post-training baselines and 5% over prompt-only optimization. The method significantly enhances trainability and generalization of complex, modular AI systems.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-module LM programs with GRPO

Handling variable-length and interrupted trajectories

Improving accuracy in classification and search tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-module GRPO groups LM calls by module

Handles variable-length and interrupted trajectories

Combines GRPO with automatic prompt optimization

🔎 Similar Papers

No similar papers found.