Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the compliance risks posed by large language models (LLMs) inadvertently reproducing copyrighted code fragments from their training data, a challenge exacerbated by the inefficiency of existing fingerprint-based provenance methods at billion-scale corpora. To overcome this, the paper introduces HYBRIDSOURCETRACKER, the first two-stage hybrid framework that integrates vector retrieval with classical Winnowing fingerprints. It first employs a 300M-parameter code encoder for efficient candidate recall and then applies precise fingerprint-based reranking to enhance accuracy. Evaluated on a test set of 100,000 code snippets, the approach improves source attribution accuracy by 5.4% over pure Winnowing for code segments longer than 60 tokens, while maintaining logarithmic query complexity. Scalability and practicality are further validated through LLM-based assessment.

📝 Abstract

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

Problem

Research questions and friction points this paper is trying to address.

provenance tracking

code plagiarism

large language models

license compliance

scalable retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

provenance tracking

code retrieval

hybrid search