🤖 AI Summary
Existing competitive programming benchmarks rely on manual problem authoring, suffering from data contamination and poor scalability. To address these limitations, this paper proposes UniCode—a novel framework that integrates large language models (LLMs) with biologically inspired evolutionary mechanisms to enable fully automated, high-quality algorithmic problem generation. UniCode enhances problem diversity via single-problem expansion and intra- and cross-category fusion. It introduces a stimulus-driven test-case generation pipeline—requiring no reference solutions—that combines small-scale brute-force grounding with large-scale consensus-based validation to ensure robustness and comprehensive coverage. We construct a contamination-resistant evaluation benchmark comprising 492 problems. Evaluating 19 mainstream LLMs reveals that even the strongest model, o4-mini, achieves only a 70.3% pass rate, demonstrating UniCode’s high difficulty and strong discriminative power.
📝 Abstract
The reliance of competitive coding benchmarks on static, human-authored problems creates significant challenges, including data contamination and limited scalability. To address these issues, we introduce UniCode, a novel framework that automatically generates high-quality algorithmic problems alongside robust, contamination-resistant test cases. Inspired by biological evolution that creates better and diverse offspring, our framework leverages Large Language Models (LLMs) to systematically diversify problems through three strategies: single problem extension, same-type fusion, and cross-type fusion. A key innovation is our stress-driven test case synthesis pipeline, which generates reliable test suites without requiring a canonical ground-truth solution. This pipeline combines brute-force grounding for small-scale inputs with a consensus-based validation mechanism for large-scale inputs to ensure high correctness and coverage. We demonstrate effectiveness of our framework by curating a benchmark of 492 problems and evaluating 19 state-of-the-art LLMs. The results reveal that UniCode is highly challenging and discriminative, with the top-performing model, o4-mini, achieving a pass rate of only 70.3%. Our framework provides a scalable and reliable solution for generating dynamic evaluation datasets in coding domain.