🤖 AI Summary
This work addresses the challenge of semantic-level data contamination in code large language model evaluation, where traditional methods fail to detect non-exact yet semantically similar samples. To tackle this issue, the authors propose TRACER, a novel framework that introduces the first fine-grained three-tier semantic taxonomy for code contamination—encompassing functionally identical, nearly identical, and logic-sharing code—and constructs the first benchmark dataset dedicated to this task. TRACER employs a semantics-aware multi-level matching strategy within a coarse-to-fine detection pipeline, leveraging large language model embeddings for semantic code comparison. Experimental results demonstrate that TRACER achieves an F1 score of 0.91 in fine-grained detection and 0.92 in binary classification, substantially outperforming existing methods by 42% to 217%.
📝 Abstract
Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.