Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing tool-augmented LLM agents rely on static trajectory supervision or narrow-domain reinforcement learning (RL), resulting in poor generalization and robustness when encountering novel tools or unseen workflows. Method: We propose CodeGym—the first scalable, multi-turn interactive RL framework tailored for programming problems. It automatically transforms static coding tasks into diverse, verifiable tool-call environments by parsing code and extracting executable function interfaces. Leveraging chain-of-thought–guided end-to-end RL training, CodeGym enables agents to learn adaptive tool orchestration across complex, multi-step tasks. Contribution/Results: CodeGym significantly improves cross-task and cross-tool-composition generalization. On out-of-distribution (OOD) benchmarks—including τ-Bench—Qwen2.5-32B-Instruct achieves an 8.7-percentage-point accuracy gain, demonstrating both effectiveness and broad applicability of our framework.

Technology Category

Application Category

📝 Abstract
Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, and generalize poorly beyond development settings, leading to brittleness with new tools and unseen workflows. Because code execution reflects many structures of real-world workflows, coding problems provide a natural basis for building agent training environments. Motivated by this, we introduce CodeGym, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym rewrites static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark $τ$-Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments that align with real-world agent workflows.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization of tool-augmented LLMs beyond narrow training tasks
Addressing brittleness of LLM agents with new tools and unseen workflows
Creating scalable RL environments that align with real-world agent workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes interactive tool-use environments from coding problems
Rewrites static code into verifiable multi-turn workflows
Enables reinforcement learning for generalizable tool execution