🤖 AI Summary
Existing diffusion language models, while capable of bidirectional contextual understanding and infilling, suffer from excessive parameter counts and high deployment costs. To address this, we propose CoDA—the first open-source 1.7B-parameter diffusion-based code encoder. CoDA innovatively integrates diffusion pretraining, code-centric mid-training, and instruction fine-tuning, augmented by a confidence-guided sampling mechanism that enhances generation quality without compromising inference latency. The model is trained and deployed efficiently using a TPU-optimized framework. On standard code-generation benchmarks—HumanEval, MBPP, and EvalPlus—CoDA matches or surpasses the performance of 7B-parameter diffusion models, demonstrating strong capabilities in program synthesis and context-aware code infilling. We fully open-source the model weights, evaluation toolkit, and reproducible training pipeline to foster community advancement.
📝 Abstract
Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.