🤖 AI Summary
N-gram code watermarking—widely adopted for copyright attribution of AI-generated code—exhibits fundamental robustness failures against code obfuscation attacks. Method: We formally model code obfuscation and, under a distribution-consistency assumption, rigorously prove that N-gram watermarks cannot simultaneously satisfy robustness and practicality. Our evaluation spans four obfuscation tools, two large language models, two programming languages, four benchmark datasets, and three state-of-the-art watermarking schemes. Contribution/Results: All detectors degrade to near-random performance post-obfuscation, with AUROC collapsing to ≈0.5; none exceed 0.6. This work systematically exposes the fragility of existing approaches and introduces a new paradigm for robust watermark design resilient to semantics-preserving transformations—providing both theoretical foundations and practical guidelines for trustworthy AI code provenance.
📝 Abstract
Distinguishing AI-generated code from human-written code is becoming crucial for tasks such as authorship attribution, content tracking, and misuse detection. Based on this, N-gram-based watermarking schemes have emerged as prominent, which inject secret watermarks to be detected during the generation.
However, their robustness in code content remains insufficiently evaluated. Most claims rely solely on defenses against simple code transformations or code optimizations as a simulation of attack, creating a questionable sense of robustness. In contrast, more sophisticated schemes already exist in the software engineering world, e.g., code obfuscation, which significantly alters code while preserving functionality. Although obfuscation is commonly used to protect intellectual property or evade software scanners, the robustness of code watermarking techniques against such transformations remains largely unexplored.
In this work, we formally model the code obfuscation and prove the impossibility of N-gram-based watermarking's robustness with only one intuitive and experimentally verified assumption, distribution consistency, satisfied. Given the original false positive rate of the watermarking detection, the ratio that the detector failed on the watermarked code after obfuscation will increase to 1 - fpr.
The experiments have been performed on three SOTA watermarking schemes, two LLMs, two programming languages, four code benchmarks, and four obfuscators. Among them, all watermarking detectors show coin-flipping detection abilities on obfuscated codes (AUROC tightly surrounds 0.5). Among all models, watermarking schemes, and datasets, both programming languages own obfuscators that can achieve attack effects with no detection AUROC higher than 0.6 after the attack. Based on the theoretical and practical observations, we also proposed a potential path of robust code watermarking.