🤖 AI Summary
This work addresses the limitations of large language models in embedded systems code generation, where performance suffers due to a lack of specialized knowledge about hardware registers, vendor SDKs, and RTOS APIs. To bridge this gap, the authors propose a domain-specific continual pretraining framework leveraging a 23.5B-token corpus curated from documentation of 117 vendors. They introduce SpecMap, a novel hierarchical data-to-code mapping technique that significantly enhances corpus quality. The OLMo-3-7B model is then fine-tuned using high-rank LoRA (rank=512, BF16) on eight H100 GPUs. Evaluation shows a 70.4% reduction in in-domain perplexity and a 66.1% reduction on held-out repositories. On 13 embedded code completion benchmarks, this 7B open-source model outperforms both Claude Opus 4.6 and Qwen3-Coder-30B on eight tasks.
📝 Abstract
Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.