🤖 AI Summary
This work addresses the unreliability of tool integration, degraded agent behavior, and security risks stemming from semantic distortions or omissions in free-text descriptions of Model Context Protocol (MCP) servers. We propose the first “smell” assessment framework tailored to MCP tool descriptions, integrating software documentation standards with agent interaction requirements to establish a four-dimensional quality model encompassing accuracy, functionality, information completeness, and conciseness. The framework systematically defines 18 categories of description smells. Large-scale empirical analysis of over ten thousand MCP servers reveals that 73% of descriptions redundantly repeat the tool name and commonly exhibit parameter semantics errors or missing return values. Descriptions adhering to our quality criteria are selected by large language models (LLMs) in competitive settings with a 72% probability—representing a 260% improvement over baseline descriptions.
📝 Abstract
The Model Context Protocol (MCP) has rapidly become a de facto standard for connecting LLM-based agents with external tools via reusable MCP servers. In practice, however, server selection and onboarding rely heavily on free-text tool descriptions that are intentionally loosely constrained. Although this flexibility largely ensures the scalability of MCP servers, it also creates a reliability gap that descriptions often misrepresent or omit key semantics, increasing trial-and-error integration, degrading agent behavior, and potentially introducing security risks. To this end, we present the first systematic study of description smells in MCP tool descriptions and their impact on usability. Specifically, we synthesize software/API documentation practices and agentic tool-use requirements into a four-dimensional quality standard: accuracy, functionality, information completeness, and conciseness, covering 18 specific smell categories. Using this standard, we conducted a large-scale empirical study on a well-constructed dataset of 10,831 MCP servers. We find that description smells are pervasive (e.g., 73% repeated tool names, thousands with incorrect parameter semantics or missing return descriptions), reflecting a "code-first, description-last" pattern. Through a controlled mutation-based study, we show these smells significantly affect LLM tool selection, with functionality and accuracy having the largest effects (+11.6% and +8.8%, p < 0.001). In competitive settings with functionally equivalent servers, standard-compliant descriptions reach 72% selection probability (260% over a 20% baseline), demonstrating that smell-guided remediation yields substantial practical benefits. We release our labeled dataset and standards to support future work on reliable and secure MCP ecosystems.