🤖 AI Summary
Large language models exhibit limited performance on competition-level programming tasks, and existing approaches often rely on extensive sampling or costly fine-tuning. This work proposes a feedback-driven solving framework that requires no parameter updates, modeling the programming process as a calibrated stopping-time procedure. By introducing a structured certificate mechanism, it establishes—for the first time—a formal connection between risk control and a verifiable lower bound on success probability. The method integrates dual-granularity verification, test augmentation, and experience-driven self-evolution, augmented with a pre-declared finite controller and trajectory calibration. Evaluated on LiveCodeBench Pro, it improves Pass@1 from 25.8% to 48.5%, and achieves an 11.0% gain in Refine@5 on ICPC-Eval, demonstrating state-of-the-art cost-accuracy efficiency across multiple mainstream large language models.
📝 Abstract
Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.