🤖 AI Summary
This study evaluates the practical performance of open-source large code models on real-world, multi-file React Native application generation tasks, revealing a significant discrepancy with their rankings on the SWE-Bench benchmark. Conducted on the NVIDIA GH200 platform, the evaluation assesses models—including Kimi-K2.5 (Q3/Q4), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2—based on out-of-the-box usability and functional correctness. The findings indicate that SWE-Bench scores poorly predict real-world efficacy; setting inference temperature to zero often causes sampling stalls and exposes internal reasoning traces, while most models lack sufficient training data for translating mobile APIs to web equivalents. The work further introduces an “efficiency-oriented” modeling paradigm that achieves performance comparable to “scale-oriented” approaches at roughly one-seventh the hardware cost, with Kimi-K2.5 (3-bit) demonstrating the strongest results.
📝 Abstract
We evaluate five state-of-the-art open-weights coding language models -- Kimi-K2.5 (at Q3 and Q4 quantizations), GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2 -- on a single multi-file React Native application generation task on NVIDIA GH200 576 GB hardware. The task specifies authentication, per-user per-day counting, and web compatibility, and is evaluated on whether the generated project runs out-of-the-box and on feature-level correctness. We find that SWE-Bench rankings do not predict task performance: Kimi-K2.5 at aggressive 3-bit quantization (UD-Q3_K_XL, 480 GB) produces the most complete and specification-compliant output, outranking models with substantially higher SWE-Bench Pro scores. We document three novel deployment findings: (1) default temperature=0 in coding tools causes sampling hangs with reasoning-model architectures, (2) reasoning-model thinking traces can leak through integration tools' file-path parsers, and (3) web-platform adaptation of native-mobile APIs is a universal training-data gap across every model tested. We also map the hardware-tier structure of April 2026 open-weights coding models, identifying two architectural schools and showing that the efficiency school (10-15 B active parameters) delivers equivalent SWE-Bench results at roughly 1/7th the hardware cost of the scale school (32-40 B active parameters).