Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to malicious knowledge injection during knowledge-intensive reasoning, a risk exacerbated by the lack of systematic evaluation of its impact on reasoning safety. The authors propose EditRisk-Bench, a unified evaluation framework that, for the first time, focuses specifically on how adversarial edits compromise reasoning safety. It encompasses diverse attack types—including misinformation, bias, and safety violations—and integrates multi-level reasoning tasks with mainstream knowledge editing methods. The framework introduces a comprehensive assessment protocol measuring attack effectiveness, reasoning correctness, and unintended side effects. Experimental results demonstrate that malicious edits can reliably induce erroneous or unsafe reasoning while preserving the model’s general capabilities, further revealing editing scale, knowledge characteristics, and reasoning complexity as key determinants of risk.

📝 Abstract

Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.

Problem

Research questions and friction points this paper is trying to address.

knowledge editing

safety risks

knowledge-intensive reasoning

malicious knowledge

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge editing

safety benchmark

malicious knowledge injection