🤖 AI Summary
Current large language model (LLM) agents predominantly rely on informal prompts to encode skills, lacking built-in support for workflow state, execution policies, and completion criteria, which often leads to a disconnect between reasoning and action. This work proposes Formal Skill—a runtime-native, reusable capability abstraction that encodes skills as executable state machines. Each skill is defined via a JSON Schema interface, implemented with a Python executor for action logic, and enhanced with event-driven hooks and local state management. This approach represents the first systematic shift from prompt-based skill descriptions to stateful, policy-constrained executable units. By doing so, it substantially improves skill composability, observability, and policy enforcement. Evaluated on Harness-Bench, the method achieves highly competitive performance with significantly fewer token expenditures, particularly excelling in tasks requiring structured skill dependencies.
📝 Abstract
Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.