🤖 AI Summary
Existing research overlooks evaluating large language models’ (LLMs) capability to detect security threats from system logs of malicious or compromised Model Context Protocol (MCP) servers. Method: We bridge this gap by introducing the first synthetic benchmark for MCP server log risk identification—establishing a fine-grained taxonomy of nine risk categories and releasing a high-quality dataset of 2,421 dialogue histories; we generate logs using ten mainstream LLMs and optimize detection performance via supervised fine-tuning (SFT) and verifiable-reward reinforcement learning (RLVR) based on Group Relative Policy Optimization (GRPO). Results: After GRPO training, Llama3.1-8B-Instruct achieves 83% accuracy—9 percentage points higher than the best remote LLM—demonstrating superior precision–recall trade-off. This work provides the first systematic validation of LLMs’ feasibility and limitations in performing security inference directly at the MCP log layer.
📝 Abstract
Large language models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning from Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83% accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at: https://github.com/PorUna-byte/MCP-RiskCue