🤖 AI Summary
This study addresses the pervasive presence of “code smells” in Model Context Protocol (MCP) tool descriptions—such as missing or ambiguous information—which lead AI agents to select inappropriate tools or pass incorrect parameters, thereby degrading task performance. The work presents the first systematic definition of MCP description smells and introduces a scoring framework based on six descriptive components. Applying this framework to a large-scale empirical analysis of 856 tools across 103 servers reveals that 97.1% exhibit at least one smell. Enhancing descriptions comprehensively improves median task success rates by 5.85% and goal completion rates by up to 15.12%, albeit at a 67.46% increase in execution steps. To mitigate this overhead, the authors propose a compact enhancement strategy that substantially reduces token consumption while preserving performance reliability, highlighting a critical trade-off between description quality and agent efficiency.
📝 Abstract
The Model Context Protocol (MCP) standardizes how Foundation Model (FM)-based agents interact with external systems by invoking tools. However, to understand a tool's purpose and features, FMs rely on natural-language tool descriptions, making these descriptions a critical component in guiding FMs to select the optimal tool for a given (sub)task and to pass the right arguments to the tool. While defects or smells in these descriptions can misguide FM-based agents, their prevalence and consequences in the MCP ecosystem remain unclear. To address this, we conduct the first large-scale empirical study of 856 tools spread across 103 MCP servers, assessing their description quality and their impact on agent performance. We identify six components of tool descriptions from the literature, develop a scoring rubric utilizing these components, then formalize tool description smells based on this rubric. By operationalizing this rubric through an FM-based scanner, we find that 97.1% of the analyzed tool descriptions contain at least one smell, with 56% failing to state their purpose clearly. While augmenting these descriptions for all components improves task success rates by a median of 5.85 percentage points and improves partial goal completion by 15.12%, it also increases the number of execution steps by 67.46% and regresses performance in 16.67% of cases. These findings highlight a trade-off between agent performance and cost, as well as the context sensitivity of the performance gain. Furthermore, component ablations show that compact variants of different component combinations often preserve behavioral reliability while reducing unnecessary token overhead, enabling more efficient use of the FM context window and lower execution costs.