π€ AI Summary
Current agent skills lack a unified framework for evaluating both utility and safety. This work proposes a standardized comparative assessment methodology that holistically measures skill effectiveness and risk through paired execution comparisons, standalone safety probing tests, normalized output artifacts, and multidimensional scoring. Grounded in principles of comparative utility and user-level simplicity, the approach yields distinct utility and safety scores along with a three-tier safety status label, enabling cross-skill, comparable quality evaluation. The methodology has been operationalized in an open public service platform, skilltester.ai, which supports automated and standardized evaluation of AI agent skills, thereby providing critical infrastructure for trustworthy AI applications.
π Abstract
This technical report presents SkillTester, a tool for evaluating the utility and security of agent skills. Its evaluation framework combines paired baseline and with-skill execution conditions with a separate security probe suite. Grounded in a comparative utility principle and a user-facing simplicity principle, the framework normalizes raw execution artifacts into a utility score, a security score, and a three-level security status label. More broadly, it can be understood as a comparative quality-assurance harness for agent skills in an agent-first world. The public service is deployed at https://skilltester.ai, and the broader project is maintained at https://github.com/skilltester-ai/skilltester.