CoDA: Coding LM via Diffusion Adaptation

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion language models, while capable of bidirectional contextual understanding and infilling, suffer from excessive parameter counts and high deployment costs. To address this, we propose CoDA—the first open-source 1.7B-parameter diffusion-based code encoder. CoDA innovatively integrates diffusion pretraining, code-centric mid-training, and instruction fine-tuning, augmented by a confidence-guided sampling mechanism that enhances generation quality without compromising inference latency. The model is trained and deployed efficiently using a TPU-optimized framework. On standard code-generation benchmarks—HumanEval, MBPP, and EvalPlus—CoDA matches or surpasses the performance of 7B-parameter diffusion models, demonstrating strong capabilities in program synthesis and context-aware code infilling. We fully open-source the model weights, evaluation toolkit, and reproducible training pipeline to foster community advancement.

Technology Category

Application Category

📝 Abstract
Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.
Problem

Research questions and friction points this paper is trying to address.

Developing lightweight diffusion language models for code generation
Enabling competitive inference latency with confidence-guided sampling
Providing open-source training pipelines for coding assistant research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight diffusion coder with 1.7B parameters
Combines pre-training with code-centric mid-training
Uses confidence-guided sampling for competitive inference
🔎 Similar Papers
No similar papers found.