MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research overlooks evaluating large language models’ (LLMs) capability to detect security threats from system logs of malicious or compromised Model Context Protocol (MCP) servers. Method: We bridge this gap by introducing the first synthetic benchmark for MCP server log risk identification—establishing a fine-grained taxonomy of nine risk categories and releasing a high-quality dataset of 2,421 dialogue histories; we generate logs using ten mainstream LLMs and optimize detection performance via supervised fine-tuning (SFT) and verifiable-reward reinforcement learning (RLVR) based on Group Relative Policy Optimization (GRPO). Results: After GRPO training, Llama3.1-8B-Instruct achieves 83% accuracy—9 percentage points higher than the best remote LLM—demonstrating superior precision–recall trade-off. This work provides the first systematic validation of LLMs’ feasibility and limitations in performing security inference directly at the MCP log layer.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning from Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83% accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at: https://github.com/PorUna-byte/MCP-RiskCue
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect security risks from MCP server system logs
Addressing limited attention to malicious MCP server underlying system logs
Creating synthetic benchmark for identifying nine categories of MCP server risks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmark for LLM risk detection
Reinforcement learning optimizes security accuracy
Fine-grained analysis of MCP server vulnerabilities
🔎 Similar Papers
No similar papers found.