S*: Test Time Scaling for Code Generation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This paper addresses the underexplored problem of computational resource scaling at test time for code generation. We propose S*, the first hybrid test-time scaling framework tailored for code generation, integrating both parallel and sequential scaling strategies. S* introduces execution-feedback-driven adaptive input generation, pairwise comparison-based solution selection, and dynamic solution filtering. Our key contributions are: (1) establishing the first hybrid test-time scaling paradigm for code generation; (2) designing an execution-verified dynamic ranking and filtering mechanism; and (3) demonstrating—on LiveCodeBench—that non-reasoning models (e.g., a 3B-parameter model) surpass GPT-4o-mini, while GPT-4o-mini+S* improves over o1-preview by 3.7%, and DeepSeek-R1-Distill-Qwen-32B+S* achieves 85.7%, approaching o1(high)’s 88.5%.

Technology Category

Application Category

📝 Abstract

Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.

Problem

Research questions and friction points this paper is trying to address.

Improves code generation coverage

Enhances selection accuracy

Boosts performance across models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid test-time scaling framework

Adaptive input generation mechanism

Execution-grounded solution identification

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark