ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code language models (Code-LMs) suffer from strong syntax–semantics coupling and low sample efficiency during pretraining. Method: This paper proposes a decoupled pretraining paradigm based on source–obfuscated code pairs. Contribution/Results: (1) We introduce obfuscation grounding—a novel pretraining objective that explicitly models semantic invariance under syntactic transformations; (2) we construct ObscuraX, the first large-scale multilingual (7 languages) source–obfuscated code-pair dataset (55M pairs); (3) we integrate contrastive alignment with multilingual joint modeling to support scalable pretraining across 255M–2.8B parameters. Our approach outperforms autoregressive and DOBF baselines across diverse downstream tasks—including syntactic/semantic understanding, multilingual code completion, commit message generation, and library-oriented code generation—while achieving significantly higher pretraining sample efficiency.

Technology Category

Application Category

📝 Abstract
Language models (LMs) have become a staple of the code-writing toolbox. Their pre-training recipe has, however, remained stagnant over recent years, barring the occasional changes in data sourcing and filtering strategies. In particular, research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse, especially compared with corresponding efforts in natural language LMs. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency. To this end, we compile ObscuraX, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train ObscuraCoder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes ObscuraX and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs' abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. ObscuraCoder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation.
Problem

Research questions and friction points this paper is trying to address.

Improving Code-LMs' pre-training efficiency via obfuscation grounding
Enhancing syntax-semantics disentanglement in code language models
Boosting multilingual code completion and generation capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses obfuscated code for pre-training efficiency
Introduces ObscuraX dataset with 55M code pairs
Improves multilingual and semantic code understanding
🔎 Similar Papers
No similar papers found.