AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) suffer from inflated evaluation scores due to pretraining data contamination, necessitating debiased and interpretable assessment methods. To address this, we propose AdEval—a knowledge-aligned dynamic evaluation framework. First, it extracts core concepts from static datasets via knowledge graph construction. Second, grounded in Bloom’s taxonomy of cognitive domains, it dynamically generates knowledge-aligned questions spanning six hierarchical competencies. Third, it integrates online retrieval augmentation to enhance sample quality and interpretability, coupled with adaptive sampling under controllable complexity constraints. AdEval introduces the first “knowledge-concept–question” dynamic alignment mechanism, enabling fine-grained, cross-cognitive-level assessment. Extensive experiments across multiple benchmarks demonstrate that AdEval significantly mitigates data contamination effects, thereby improving evaluation fairness, reliability, and transparency.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) are pretrained on massive-scale corpora, the issue of data contamination has become increasingly severe, leading to potential overestimation of model performance during evaluation. To address this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability. AdEval extracts key knowledge points and main ideas to align dynamically generated questions with static data's core concepts. It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples with robust knowledge support. Furthermore, AdEval incorporates mechanisms to control the number and complexity of questions, enabling dynamic alignment and flexible adjustment. This ensures that the generated questions align with the complexity of static data while supporting varied complexity levels. Based on Bloom's taxonomy, AdEval conducts a multi-dimensional evaluation of LLMs across six cognitive levels: remembering, understanding, applying, analyzing, evaluating, and creating. Experimental results on multiple datasets demonstrate that AdEval effectively reduces the impact of data contamination on evaluation outcomes, enhancing both the fairness and reliability of the evaluation process.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Data Contamination

Model Evaluation Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdEval

Language Model Assessment

Data Contamination Mitigation

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models

2024-03-31Citations: 6

Authors to Follow