GenClaw: Code-Driven Agentic Image Generation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image generation agents predominantly rely on black-box models, offering only indirect control over outputs through iterative prompt engineering and lacking precise manipulation of visual content. This work proposes a code-driven paradigm for image generation that synergistically integrates large language models, web search, programmatic graphics (SVG/HTML/Three.js), and diffusion models. The approach first leverages reasoning to construct high-level concepts, then generates structured sketches via executable code, and finally injects texture and photorealism using diffusion models. By introducing code as a controllable intermediate representation, this method bridges linguistic reasoning and pixel-level generation for the first time, enabling a staged, interpretable, and highly controllable creation pipeline that significantly enhances precise control over image layout, structure, and semantics.
📝 Abstract
Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.
Problem

Research questions and friction points this paper is trying to address.

image generation
agentic AI
black-box models
visual control
prompt engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

code-driven
agentic image generation
executable sketches
controllable canvas
multimodal reasoning