ScEdit: Script-based Assessment of Knowledge Editing

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing knowledge editing (KE) benchmarks oversimplify evaluation, failing to reflect real-world performance—particularly in LLM-as-agent scenarios requiring dynamic, context-aware updates. Method: We introduce ScEdit, the first script-driven KE benchmark supporting counterfactual and time-sensitive edits. It shifts evaluation from “What-type” (fact correctness) to “How-type” (action feasibility) via scripted modeling, multi-granularity assessment (token-level and text-level), and an agent behavior–oriented evaluation protocol. Contribution/Results: Experiments reveal that state-of-the-art KE methods suffer substantial degradation on text-level metrics—exposing fundamental limitations in practical deployment. ScEdit is open-sourced and has been adopted by the research community.

Technology Category

Application Category

📝 Abstract
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Knowledge Editing in real-world scenarios
Assessing KE methods with script-based benchmarks
Extending fact-based to action-based KE evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces script-based benchmark ScEdit
Combines token and text-level evaluations
Extends evaluation to action-based questions
🔎 Similar Papers
No similar papers found.
Xinye Li
Xinye Li
Harbin Institute of Technology
Large Language ModelNatural Language ProcessingAgentKnowledge EditingInterpretability
Z
Zunwen Zheng
Harbin Institute of Technology
Q
Qian Zhang
Harbin Institute of Technology
D
Dekai Zhuang
Jilin University
J
Jiabao Kang
Harbin Institute of Technology
Liyan Xu
Liyan Xu
WeChat AI, Tencent
Natural Language ProcessingMachine Learning
Q
Qingbin Liu
Tencent
X
Xi Chen
Tencent
Zhiying Tu
Zhiying Tu
Harbin Institute of Technology
software engineering
D
Dianhui Chu
Harbin Institute of Technology
Dianbo Sui
Dianbo Sui
Harbin Institute of Technology