🤖 AI Summary
This work addresses two key challenges in cross-lingual knowledge editing: weak cross-lingual generalization and the absence of comprehensive evaluation benchmarks. To this end, we introduce BMIKE-53—the first multilingual contextual knowledge editing benchmark covering 53 languages—unifying zsRE, CounterFact, and WikiFactDiff under a consistent framework to systematically evaluate zero-, one-, and few-shot cross-lingual generalization. We further establish the first large-scale multilingual In-context Knowledge Editing (IKE) benchmark, uncovering critical factors influencing editing efficacy: model scale, demonstration alignment quality, and script type (e.g., Latin vs. non-Latin). We propose a metric-guided demonstration design methodology. Experiments show that larger models and language-aligned demonstrations significantly improve cross-lingual editing accuracy, whereas non-Latin scripts suffer performance degradation due to orthographic–phonetic ambiguity. This work provides an empirically grounded, reproducible benchmark and actionable insights for advancing multilingual knowledge editing.
📝 Abstract
This paper introduces BMIKE-53, a comprehensive benchmark for cross-lingual in-context knowledge editing (IKE) across 53 languages, unifying three knowledge editing (KE) datasets: zsRE, CounterFact, and WikiFactDiff. Cross-lingual KE, which requires knowledge edited in one language to generalize across others while preserving unrelated knowledge, remains underexplored. To address this gap, we systematically evaluate IKE under zero-shot, one-shot, and few-shot setups, incorporating tailored metric-specific demonstrations. Our findings reveal that model scale and demonstration alignment critically govern cross-lingual IKE efficacy, with larger models and tailored demonstrations significantly improving performance. Linguistic properties, particularly script type, strongly influence performance variation across languages, with non-Latin languages underperforming due to issues like language confusion.