🤖 AI Summary
Existing agent benchmarks struggle to effectively evaluate direct utilization of Unix system primitives, often conflating this capability with general programming skills. This work proposes an operational definition of Unix proficiency and introduces a procedurally generated CTF-style task environment: each task embeds a flag within an isolated Linux container by leveraging a single Unix feature, requiring agents to recover it using only shell interactions. High-fidelity, reusable tasks are efficiently synthesized through an LLM-assisted bidirectional contract generation pipeline that combines parameterized hide/seek script pairs with container isolation mechanisms. The resulting benchmark comprises 656 validated tasks (87.5% generation success rate), spanning 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA and GRPO reinforcement learning boosts its solve rate on the holdout set from 11.6% to 43.6%, with Forensics subtasks showing a 33-percentage-point improvement.
📝 Abstract
Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Python but weak in Unix can pass a substantial fraction of Terminal-Bench 2.0, while the reverse skill profile is rarely exercised. We make the distinction operational and build a training surface for the Unix component. unix-ctf is a procedural generator of capture-the-flag tasks for shell agents. Each task hides a short token (a flag of the form flag(a3b1c9...)) inside a fresh Linux container using a single Unix feature, and the agent must recover it. Tasks are produced by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract: the hide script must leave no plaintext trace of the flag on disk, and the find script must recover the flag in a fresh directory. Because the LLM only writes the planting and recovery steps (the container, layout, and grading harness are fixed), the pipeline lands 656 of 750 raw attempts as portable, reusable variants (87.5\%). Our reproduction of Endless Terminals' full-container-generation approach lands only 17.4\% under the same checks. The 656 variants canonicalize to 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6\% to 43.6\% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF. These results suggest that Unix competence is separable, trainable, and best evaluated directly rather than folded into programming-through-a-shell.