ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

217K/year
πŸ€– AI Summary
Manually constructing training and evaluation environments for claw-like agents is costly and difficult to scale. To address this challenge, this work proposes ClawEnvKitβ€”the first automated pipeline capable of generating, validating, and on-demand constructing executable environments directly from natural language specifications. The framework comprises three core modules: parsing, environment generation, and multidimensional validation. Leveraging this pipeline, we introduce Auto-ClawEval, the first large-scale benchmark comprising 1,040 environments across 24 task categories, which matches human-designed environments in semantic clarity and logical coherence while reducing construction costs by a factor of 13,800. Experimental results demonstrate that integrating this framework into agent training strategies yields a performance improvement of up to 15.7 percentage points.

Technology Category

Application Category

πŸ“ Abstract
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
Problem

Research questions and friction points this paper is trying to address.

claw-like agents
environment generation
automated pipeline
benchmarking
training environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

automatic environment generation
claw-like agents
natural language to environment
validated benchmark
on-demand evaluation