🤖 AI Summary
In competitive programming, LLM-generated code often passes unit tests but violates time or memory constraints. To address this, we propose a multi-agent framework tailored for programming contests, integrating algorithmic planning, code generation, empirical performance analysis, and complexity-guided repair to jointly optimize correctness and resource efficiency. Our approach introduces a novel complexity analysis combining static pruning with LLM-based fallback, an empirical feedback mechanism leveraging log-log fitting and R² evaluation, and safe execution of C++17 binaries within a POSIX sandbox under fixed-scale runtime conditions. Evaluated on 26 contest problems, our framework achieves a first-attempt success rate of 61.5%, a three-try success rate of 80.8%, and an average solving time of 12.4 seconds. Compared to Claude Opus 4, it improves runtime success by 73.1% and supports fine-grained efficiency metrics—including eff@k and TLE/MLE incidence rates.
📝 Abstract
Correctness alone is insufficient: LLM-generated programs frequently satisfy unit tests while violating contest time or memory budgets. We present SwiftSolve, a complexity-aware multi-agent system for competitive programming that couples algorithmic planning with empirical profiling and complexity-guided repair. We frame competitive programming as a software environment where specialized agents act as programmers, each assuming roles such as planning, coding, profiling, and complexity analysis. A Planner proposes an algorithmic sketch; a deterministic Static Pruner filters high-risk plans; a Coder emits ISO C++17; a Profiler compiles and executes candidates on a fixed input-size schedule to record wall time and peak memory; and a Complexity Analyst fits log-log growth (s, R2) with an LLM fallback to assign a complexity class and dispatch targeted patches to either the Planner or Coder. Agents communicate via typed, versioned JSON; a controller enforces iteration caps and diminishing returns stopping. Evaluated on 26 problems (16 BigO, 10 Codeforces Div. 2) in a POSIX sandbox (2 s / 256-512 MB), SwiftSolve attains pass@1 = 61.54% (16/26) on the first attempt and Solved@<=3 = 80.77% with marginal latency change (mean 11.96 s to 12.66 s per attempt). Aggregate run-level success is 73.08% at 12.40 s mean. Failures are predominantly resource-bound, indicating inefficiency rather than logic errors. Against Claude Opus 4, SwiftSolve improves run-level success (73.1% vs 52.6%) at approximately 2x runtime overhead (12.4 s vs 6.8 s). Beyond correctness (pass@k), we report efficiency metrics (eff@k for runtime and memory, incidence of TLE or MLE, and complexity fit accuracy on BigO), demonstrating that profiling and complexity-guided replanning reduce inefficiency while preserving accuracy.