CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Data contamination severely compromises the fairness of LLM evaluation, as test instances may inadvertently overlap with training corpora; existing mitigation methods struggle to simultaneously achieve knowledge erasure and semantic fidelity. To address this, we propose CoreEval—a contamination-resistant, semantically consistent dynamic evaluation framework. Its core innovations include (1) leveraging GDELT’s real-time global event knowledge to instantiate a dynamically updating evaluation corpus, and (2) introducing a “data reflection” module that iteratively refines label consistency via entity-relation extraction, context reconstruction, and semantic distillation. Experiments demonstrate that CoreEval significantly mitigates performance inflation caused by data contamination, improving both accuracy and robustness of LLM assessment across multiple benchmarks. By enabling scalable, adaptive, and contamination-aware evaluation, CoreEval establishes a new paradigm for trustworthy large language model benchmarking.

Technology Category

Application Category

📝 Abstract

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose extbf{CoreEval}, a extbf{Co}ntamination- extbf{re}silient extbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism is employed to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

Problem

Research questions and friction points this paper is trying to address.

Addressing data contamination in LLM evaluations for fair assessments

Automatically updating datasets with real-world knowledge to ensure resilience

Preserving semantic complexity while eliminating pre-existing model knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically updates data with real-world knowledge

Extracts entity relationships and retrieves current information

Recontextualizes and refines data for semantic coherence

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models