CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data contamination severely compromises the fairness of LLM evaluation, as test instances may inadvertently overlap with training corpora; existing mitigation methods struggle to simultaneously achieve knowledge erasure and semantic fidelity. To address this, we propose CoreEval—a contamination-resistant, semantically consistent dynamic evaluation framework. Its core innovations include (1) leveraging GDELT’s real-time global event knowledge to instantiate a dynamically updating evaluation corpus, and (2) introducing a “data reflection” module that iteratively refines label consistency via entity-relation extraction, context reconstruction, and semantic distillation. Experiments demonstrate that CoreEval significantly mitigates performance inflation caused by data contamination, improving both accuracy and robustness of LLM assessment across multiple benchmarks. By enabling scalable, adaptive, and contamination-aware evaluation, CoreEval establishes a new paradigm for trustworthy large language model benchmarking.

Technology Category

Application Category

📝 Abstract
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose extbf{CoreEval}, a extbf{Co}ntamination- extbf{re}silient extbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
Problem

Research questions and friction points this paper is trying to address.

Addressing data contamination in LLM evaluations for fair assessments
Automatically updating datasets with real-world knowledge to ensure resilience
Preserving semantic complexity while eliminating pre-existing model knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically updates data with real-world knowledge
Extracts entity relationships and retrieves current information
Recontextualizes and refines data for semantic coherence
🔎 Similar Papers
No similar papers found.