ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Code language models (Code-LMs) suffer from strong syntax–semantics coupling and low sample efficiency during pretraining. Method: This paper proposes a decoupled pretraining paradigm based on source–obfuscated code pairs. Contribution/Results: (1) We introduce obfuscation grounding—a novel pretraining objective that explicitly models semantic invariance under syntactic transformations; (2) we construct ObscuraX, the first large-scale multilingual (7 languages) source–obfuscated code-pair dataset (55M pairs); (3) we integrate contrastive alignment with multilingual joint modeling to support scalable pretraining across 255M–2.8B parameters. Our approach outperforms autoregressive and DOBF baselines across diverse downstream tasks—including syntactic/semantic understanding, multilingual code completion, commit message generation, and library-oriented code generation—while achieving significantly higher pretraining sample efficiency.

Technology Category

Application Category

📝 Abstract

Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.

Problem

Research questions and friction points this paper is trying to address.

Improving Code-LMs' pre-training efficiency via obfuscation grounding

Enhancing syntax-semantics disentanglement in code language models

Boosting multilingual code completion and generation capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses obfuscated code for pre-training efficiency

Introduces ObscuraX dataset with 55M code pairs

Improves multilingual and semantic code understanding

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?