🤖 AI Summary
The absence of standardized benchmarks for evaluating large language models’ (LLMs) knowledge memorization capability hinders systematic assessment and improvement. Method: We introduce WikiDYK—the first dynamically evolving, real-world knowledge injection benchmark—built upon Wikipedia’s manually curated “Did You Know” facts, automatically generating over 77K multi-format question-answer pairs. We systematically demonstrate that bidirectional language models (BiLMs) exhibit significantly higher knowledge memorization reliability than causal language models (CLMs), achieving a +23% accuracy gain. Furthermore, we propose a BiLM-augmented modular framework wherein BiLMs serve as plug-in external knowledge repositories, enabling collaborative reasoning with LLMs to overcome single-model capacity limitations. Results: Experiments show that our framework boosts reliability accuracy by up to 29.1%. This work establishes a novel paradigm and a high-quality benchmark for evaluating and enhancing LLMs’ knowledge memorization capabilities.
📝 Abstract
Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia's"Did You Know..."entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.