🤖 AI Summary
This paper systematically exposes structural security risks posed by knowledge editing (KE) techniques to large language models (LLMs), introducing “model malleability” as a novel AI safety threat. Methodologically, it conducts empirical evaluations of mainstream KE methods, attack surface modeling, AI supply chain auditing, and socio-technical system assessment. The study identifies four core risk vectors: (1) low technical barriers to KE tool access, (2) high generalizability of malicious use cases, (3) absence of verification mechanisms in model distribution, and (4) severe institutional lag in regulatory and governance responses. Findings reveal that KE’s stealthiness, low cost, and ecosystem openness render it highly susceptible to adversarial model tampering and harmful model proliferation. The work contributes a tripartite mitigation framework: tamper-resistant model architectures, robust model watermarking, and end-to-end governance protocols—urging the community to embed defensive design principles into LLM development and deployment lifecycles.
📝 Abstract
Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.