🤖 AI Summary
This work addresses the significant performance degradation of existing large code models in industrial settings characterized by strong hardware semantics, domain-specific language structures, and stringent resource constraints. To bridge this gap, we propose the first industrial-scale unified code foundation model with 32 billion parameters, spanning critical domains including chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. The model is trained from scratch, integrating general-purpose code pretraining, curated industrial code annealing, progressive long-context expansion from 8K to 128K tokens, and execution-based post-training strategies. Experimental results demonstrate competitive performance across 14 general-purpose benchmarks and establish state-of-the-art open-source baselines on nine industrial benchmarks across four key domains.
📝 Abstract
Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.