Code Generation with Small Language Models: A Deep Evaluation on Codeforces

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation of small language models (SLMs) for competitive programming code generation. We present the first comprehensive benchmark of five open-source SLMs on 280 Codeforces problems (Elo 800–2100), spanning 36 algorithmic topics, with dual-language (Python/C++) solution assessment. Methodologically, we propose a fine-grained behavioral analysis framework incorporating Elo-stratified sampling, topic-diversity quantification, qualitative error attribution, and cross-lingual output fusion. Key findings reveal that SLM failures stem primarily from implementation-level inaccuracies—not fundamental reasoning deficits. PHI-4 14B achieves 63.6% pass@3 in Python; integrating its C++ outputs raises performance to 73.6%, approaching the commercial O3-MINI-HIGH (86.8%). This demonstrates SLMs’ practical viability for efficient, low-risk programming tasks—particularly where correctness, latency, and resource efficiency are critical.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated capabilities in code generation, potentially boosting developer productivity. However, their widespread adoption remains limited by high computational costs, significant energy demands, and security risks such as data leakage and adversarial attacks. As a lighter-weight alternative, Small Language Models (SLMs) offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks, making them an attractive option for real-world applications. While prior research has benchmarked LLMs on competitive programming tasks, such evaluations often focus narrowly on metrics like Elo scores or pass rates, overlooking deeper insights into model behavior, failure patterns, and problem diversity. Furthermore, the potential of SLMs to tackle complex tasks such as competitive programming remains underexplored. In this study, we benchmark five open SLMs - LLAMA 3.2 3B, GEMMA 2 9B, GEMMA 3 12B, DEEPSEEK-R1 14B, and PHI-4 14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%, approaching the proprietary O3-MINI-HIGH (86.8%). In addition, we evaluated PHI-4 14B on C++ and found that combining outputs from both Python and C++ increases its aggregated pass@3 to 73.6%. A qualitative analysis of PHI-4 14B's incorrect outputs revealed that some failures were due to minor implementation issues - such as handling edge cases or correcting variable initialization - rather than deeper reasoning flaws.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Small Language Models for code generation efficiency
Assessing SLMs' performance on competitive programming tasks
Identifying failure patterns in SLM-generated code solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking five open Small Language Models (SLMs)
Evaluating SLMs on 280 Codeforces problems
Combining Python and C++ outputs for better performance
🔎 Similar Papers
No similar papers found.
D
D'ebora Souza
Federal University of Campina Grande (UFCG), Brazil
Rohit Gheyi
Rohit Gheyi
Federal University of Campina Grande
Software Engineering
L
Lucas Albuquerque
Federal University of Campina Grande (UFCG), Brazil
Gustavo Soares
Gustavo Soares
Researcher, Microsoft
Software EngineeringHCIProgram SynthesisLarge Language Models
M
M'arcio Ribeiro
Federal University of Alagoas (UFAL), Brazil