Code Generation with Small Language Models: A Deep Evaluation on Codeforces

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of small language models (SLMs) for competitive programming code generation. We present the first comprehensive benchmark of five open-source SLMs on 280 Codeforces problems (Elo 800–2100), spanning 36 algorithmic topics, with dual-language (Python/C++) solution assessment. Methodologically, we propose a fine-grained behavioral analysis framework incorporating Elo-stratified sampling, topic-diversity quantification, qualitative error attribution, and cross-lingual output fusion. Key findings reveal that SLM failures stem primarily from implementation-level inaccuracies—not fundamental reasoning deficits. PHI-4 14B achieves 63.6% pass@3 in Python; integrating its C++ outputs raises performance to 73.6%, approaching the commercial O3-MINI-HIGH (86.8%). This demonstrates SLMs’ practical viability for efficient, low-risk programming tasks—particularly where correctness, latency, and resource efficiency are critical.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated capabilities in code generation, potentially boosting developer productivity. However, their widespread adoption remains limited by high computational costs, significant energy demands, and security risks such as data leakage and adversarial attacks. As a lighter-weight alternative, Small Language Models (SLMs) offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks, making them an attractive option for real-world applications. While prior research has benchmarked LLMs on competitive programming tasks, such evaluations often focus narrowly on metrics like Elo scores or pass rates, overlooking deeper insights into model behavior, failure patterns, and problem diversity. Furthermore, the potential of SLMs to tackle complex tasks such as competitive programming remains underexplored. In this study, we benchmark five open SLMs - LLAMA 3.2 3B, GEMMA 2 9B, GEMMA 3 12B, DEEPSEEK-R1 14B, and PHI-4 14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%, approaching the proprietary O3-MINI-HIGH (86.8%). In addition, we evaluated PHI-4 14B on C++ and found that combining outputs from both Python and C++ increases its aggregated pass@3 to 73.6%. A qualitative analysis of PHI-4 14B's incorrect outputs revealed that some failures were due to minor implementation issues - such as handling edge cases or correcting variable initialization - rather than deeper reasoning flaws.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Small Language Models for code generation efficiency

Assessing SLMs' performance on competitive programming tasks

Identifying failure patterns in SLM-generated code solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking five open Small Language Models (SLMs)

Evaluating SLMs on 280 Codeforces problems

Combining Python and C++ outputs for better performance

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Senior Software Engineer, AI Coding Tools

ByteDance

圣何塞

Authors to Follow