Detecting Distillation Data from Reasoning Models

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address benchmark contamination and inflated performance caused by evaluation-data leakage in inference-model distillation, this paper proposes Token Probability Deviation (TBD), a detection method that quantifies the deviation between generated token probabilities and high-quality reference probabilities to identify whether distilled data has been exposed to evaluation-set questions—enabling efficient detection even under partial data availability. TBD models the output probability distribution of language models and requires neither access to original training data nor knowledge of distillation process details. On the S1 dataset, TBD achieves an AUC of 0.918 and a TPR@1%FPR of 0.470, substantially outperforming existing baselines. This work introduces the first probabilistic-deviation-based approach for distillation-data provenance tracing, establishing a scalable, low-overhead detection paradigm to mitigate benchmark pollution.

Technology Category

Application Category

📝 Abstract
Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
Problem

Research questions and friction points this paper is trying to address.

Detecting distillation data to prevent benchmark contamination
Identifying seen versus unseen questions in reasoning models
Quantifying token probability deviations for contamination detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects distillation data via token probability deviation
Quantifies deviation from high reference probability
Distinguishes seen questions using lower deviation scores
🔎 Similar Papers
No similar papers found.
H
Hengxiang Zhang
Department of Statistics and Data Science, Southern University of Science and Technology
Hyeong Kyu Choi
Hyeong Kyu Choi
University of Wisconsin-Madison, Korea University
Agentic AIReliable AIAI SafetyComputer Vision
Y
Yixuan Li
Department of Computer Sciences, University of Wisconsin–Madison
Hongxin Wei
Hongxin Wei
Southern University of Science and Technology (SUSTech)
Reliable Machine LearningUncertainty EstimationStatistics