π€ AI Summary
This work addresses the challenge that large language model agents, with frozen weights, struggle to learn through interaction in complex multi-turn environments. To overcome this limitation, the authors propose Workspace Optimizationβa novel approach that shifts the training paradigm from weight space to a structured external workspace. This method substitutes parameters, data, loss, and gradients with artifacts, evidence, counterexamples, and textual feedback, respectively, thereby emulating a training mechanism without modifying model weights. The framework constructs an executable world model enabling multi-role collaborative reasoning and failure-aware routing. Implemented within the DreamTeam multi-agent architecture, it modularly supports hypothesis generation, planning, exploration, and strategy formulation. Evaluated on the ARC-AGI-3 public test set, the approach improves performance from 36% to 38.4% while reducing the number of interactive actions per episode by 31%.
π Abstract
Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.