UniCode: A Framework for Generating High Quality Competitive Coding Problems

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing competitive programming benchmarks rely on manual problem authoring, suffering from data contamination and poor scalability. To address these limitations, this paper proposes UniCode—a novel framework that integrates large language models (LLMs) with biologically inspired evolutionary mechanisms to enable fully automated, high-quality algorithmic problem generation. UniCode enhances problem diversity via single-problem expansion and intra- and cross-category fusion. It introduces a stimulus-driven test-case generation pipeline—requiring no reference solutions—that combines small-scale brute-force grounding with large-scale consensus-based validation to ensure robustness and comprehensive coverage. We construct a contamination-resistant evaluation benchmark comprising 492 problems. Evaluating 19 mainstream LLMs reveals that even the strongest model, o4-mini, achieves only a 70.3% pass rate, demonstrating UniCode’s high difficulty and strong discriminative power.

Technology Category

Application Category

📝 Abstract
The reliance of competitive coding benchmarks on static, human-authored problems creates significant challenges, including data contamination and limited scalability. To address these issues, we introduce UniCode, a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases. Inspired by biological evolution that creates better and diverse offspring, our framework leverages Large Language Models (LLMs) to systematically diversify problems through three strategies: single problem extension, same-type fusion, and cross-type fusion. A key innovation is our stress-driven test case synthesis pipeline, which generates reliable test suites without requiring a canonical ground-truth solution. This pipeline combines brute-force grounding for small-scale inputs with a consensus-based validation mechanism for large-scale inputs to ensure high correctness and coverage. We demonstrate effectiveness of our framework by curating a benchmark of 492 problems and evaluating 19 state-of-the-art LLMs. The results reveal that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%. Our framework provides a scalable and reliable solution for generating dynamic evaluation datasets in coding domain.
Problem

Research questions and friction points this paper is trying to address.

Automatically generates algorithmic problems to overcome static benchmark limitations
Creates contamination-resistant test cases without requiring ground-truth solutions
Provides scalable dynamic evaluation datasets for competitive coding assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generates algorithmic problems using LLMs
Diversifies problems through evolutionary fusion strategies
Synthesizes test cases without ground-truth solutions
🔎 Similar Papers
No similar papers found.
X
Xinyue Zheng
Institute for Artificial Intelligence, Peking University
Haowei Lin
Haowei Lin
Peking University
LLMAI4Science
S
Shaofei Cai
Institute for Artificial Intelligence, Peking University
Z
Zilong Zheng
BIGAI
Yitao Liang
Yitao Liang
Peking University
Machine LearningAI ReasoningAI Agent