Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant limitations in solving complex algorithmic problems from the International Collegiate Programming Contest (ICPC), particularly due to weak algorithmic reasoning and low code-generation accuracy. Method: We propose a multi-round self-feedback and retrieval-augmented reasoning framework that integrates zero-shot chain-of-thought prompting, code-snippet-based retrieval augmentation, iterative self-evaluation, and high-quality unit-test-driven assessment. Contribution/Results: We construct a rigorous benchmark of 254 authentic ICPC problems and conduct systematic evaluation. Our framework improves the pass@1 rate of the o1 model from 19.1% to 42.2% (+23.1 percentage points). Moreover, under human–AI collaboration, it solves 17 out of 18 previously unsolved problems—defying all prior techniques—and for the first time delineates the capability frontier of LLMs in high-difficulty competitive programming while revealing principled mechanisms of collaborative gain.

Technology Category

Application Category

📝 Abstract

Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.

Problem

Research questions and friction points this paper is trying to address.

Solving competitive programming problems with algorithmic thinking

Improving language model performance on hard coding tasks

Developing multi-turn self-refined retrieval techniques for LMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn self-judge with reflection

Retrieval over episodic information

Human-in-the-loop investigation

🔎 Similar Papers

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning