How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language models (LLMs) for skill utilization often rely on idealized assumptions that fail to capture their ability to autonomously retrieve and apply skills in real-world settings. This work introduces the first evaluation framework designed for realistic conditions, leveraging a corpus of 34k authentic skills to systematically assess LLMs’ task-completion efficacy using non-curated skill sets. The study reveals that under highly challenging scenarios, the performance gains conferred by skill augmentation nearly vanish. To address this limitation, the authors propose a query-specific skill refinement strategy that substantially restores model performance—demonstrated by an increase in Claude Opus 4.6’s pass rate from 57.7% to 65.5% on Terminal-Bench 2.0.
📝 Abstract
Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
skill usage
realistic settings
benchmarking
skill retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

skill retrieval
LLM agents
realistic benchmarking
skill refinement
query-specific adaptation
🔎 Similar Papers
No similar papers found.