Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing benchmarks for evaluating coding agents are often confined to single GitHub issues, limiting their ability to assess performance on the diverse and complex tasks encountered in real-world scenarios. This work proposes a task design principle based on transferable skill decomposition, which deconstructs task trajectories into fine-grained components to identify cross-task reusable skills. Leveraging this approach, we construct Hybrid-Gym—the first synthetic training environment targeting general-purpose coding capabilities—featuring extensible tasks such as function localization and dependency search. Experimental results demonstrate substantial improvements: our method achieves absolute gains of 25.4% on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite, significantly enhancing model generalization and downstream task performance.

Technology Category

Application Category

📝 Abstract

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.

Problem

Research questions and friction points this paper is trying to address.

coding agents

task generalization

transferable skills

benchmark limitations

real-world tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-Gym

coding agents

transferable skills