🤖 AI Summary
This work addresses a critical security vulnerability in code large language models (CLLMs), where memorization effects can lead to the leakage of developers’ sensitive cryptographic keys. The study introduces, for the first time, the concept of “gibberish bias,” revealing that keys exhibiting high character-level entropy but low token-level entropy—due to Byte-Pair Encoding (BPE) tokenization—are more susceptible to model memorization. Through comprehensive analyses involving BPE tokenization behavior, token entropy quantification, data distribution comparisons, and large-scale empirical evaluations, the authors systematically establish the relationship between token entropy and the ease of key memorization, further demonstrating the pronounced nature of this bias under increasing vocabulary sizes. Building on these insights, the paper proposes tokenizer-aware mitigation strategies, offering a novel direction for enhancing the security of CLLM training.
📝 Abstract
Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.