FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing large language models struggle to generate diverse code in open-ended programming scenarios due to the scarcity of large-scale, high-quality open-ended problems lacking a clear optimal solution. This work proposes the first framework that automatically evolves open-ended programming problems from closed-form ones (e.g., competitive programming tasks) through iterative modifications of objectives, constraints, and input generalizations. The approach incorporates a quantitative metric for solution diversity to select high-value problem instances and employs autonomous agents to generate test cases and validators, establishing an end-to-end automated data synthesis pipeline. Evaluated on FrontierCS and ALE-bench, the method yields substantial performance gains of +8.82 points and +306.36 Elo points, respectively, enabling models to exhibit coding behaviors more closely aligned with human long-horizon programming practices.

📝 Abstract

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.

Problem

Research questions and friction points this paper is trying to address.

open-ended coding

LLM training

problem synthesis

coding benchmarks

diverse problem generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-ended coding

problem synthesis

idea divergence