Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current reinforcement learning (RL)-based post-training of large language models (LLMs) lacks effective methods to detect data contamination—i.e., unintended leakage of training data into model outputs. Method: We propose Self-Critique, the first dedicated contamination detection framework for this stage. It leverages entropy collapse in model outputs—a signature of policy convergence—and combines self-critical probing with strategy distribution comparison to sensitively identify training-data leakage. Contribution/Results: To enable rigorous evaluation, we introduce RL-MIA, a specialized contamination benchmark. Experiments across multiple LLMs and tasks demonstrate that Self-Critique significantly outperforms existing baselines, achieving up to a 30-percentage-point improvement in AUC. It is the first method to provide reliable, discriminative contamination detection during RL post-training, thereby filling a critical methodological gap in LLM safety and alignment.

Technology Category

Application Category

📝 Abstract

Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

Problem

Research questions and friction points this paper is trying to address.

Detecting data contamination during RL post-training for LLMs

Addressing policy collapse causing output entropy reduction

Developing specialized contamination detection for RL fine-tuning phase

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects contamination via output entropy collapse analysis

Proposes Self-Critique method probing policy collapse patterns

Introduces RL-MIA benchmark for RL contamination simulation

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models