Gym-Anything: Turn any Software into an Agent Environment

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current agent research is constrained by simplistic software environments and short-horizon tasks, lacking scalable methods for constructing complex, realistic settings. This work proposes Gym-Anything, a framework that automatically transforms arbitrary software into interactive agent environments. It innovatively formulates environment construction as a multi-agent collaborative task, integrating code generation, vision-language models (VLMs), real-world data configuration, trajectory distillation, and an automated auditing mechanism. Leveraging the GDP occupational classification, we introduce CUA-World—a benchmark comprising over 10,000 long-horizon tasks, with CUA-World-Long episodes exceeding 500 steps. The distilled 2B-parameter VLM outperforms models twice its size, and the auditing feedback loop improves Gemini-3-Flash’s success rate from 11.5% to 14.0%.

Technology Category

Application Category

📝 Abstract

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

Problem

Research questions and friction points this paper is trying to address.

computer-use agents

agent environment

software automation

long-horizon tasks

environment creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gym-Anything

multi-agent environment creation

computer-use agents