CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic problem generation methods struggle to simultaneously balance difficulty control, solvability, and computational efficiency, limiting their ability to produce high-quality, competition-level challenging problems. This work proposes CoDiQ, a novel framework that, for the first time, reveals the trade-off between test-time reasoning token expansion and problem difficulty-solvability. By introducing a controllable test-time scaling strategy, CoDiQ enables fine-grained difficulty modulation. Leveraging Qwen3-8B, the CoDiQ-Generator efficiently produces a corpus of 44K high-difficulty problems (CoDiQ-Corpus). Human evaluation demonstrates that these problems significantly surpass those in LiveCodeBench and AIME benchmarks in difficulty while maintaining over 82% solvability. Furthermore, training a reasoning model (LRM) on this corpus substantially enhances its problem-solving capabilities, validating the effectiveness and practical utility of the proposed framework.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model's ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.
Problem

Research questions and friction points this paper is trying to address.

question generation
difficulty control
large reasoning models
test-time scaling
competition-level questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling
controllable difficulty
question generation
reasoning models
competition-level questions
🔎 Similar Papers
No similar papers found.
Zhongyuan Peng
Zhongyuan Peng
Fudan University
LLM
C
Caijun Xu
Fudan University
C
Changyi Xiao
Fudan University
S
Shibo Hong
Fudan University
E
Eli Zhang
M-A-P
S
Stephen Huang
M-A-P
Yixin Cao
Yixin Cao
Fudan University
Natural Language ProcessingKnowledge EngineeringMulti-modal data processing