🤖 AI Summary
This work addresses the end-to-end optimization challenge for multi-module language model programs—comprising LLM calls and external tools. We propose mmGRPO, the first extension of Group Relative Policy Optimization (GRPO) to modular systems, enabling grouped policy-gradient updates per module and supporting variable-length and truncated trajectories. Integrated with automated prompt optimization, mmGRPO is implemented as the `dspy.GRPO` optimizer within the DSPy framework. Its core innovation lies in joint cross-module policy learning coupled with co-optimization of prompts and execution logic. Evaluated on classification, multi-hop search, and privacy-preserving delegation tasks, mmGRPO achieves an average accuracy improvement of 11% over post-training baselines and 5% over prompt-only optimization. The method significantly enhances trainability and generalization of complex, modular AI systems.
📝 Abstract
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.