How to Get Your LLM to Generate Challenging Problems for Evaluation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Manually constructing high-difficulty evaluation benchmarks is costly and poorly scalable. Method: This paper proposes CHASE, an unsupervised, verifiable framework for LLMs to autonomously generate challenging evaluation instances. CHASE employs a bottom-up construction strategy with verifiable subtask decomposition, integrating hierarchical prompt engineering, task-decoupled generation, and automated satisfiability verification to synthesize hard problems in document QA, code completion, and mathematical reasoning. Contribution/Results: CHASE establishes the first human-annotation-free synthetic evaluation paradigm with logically verifiable answer correctness—ensuring both high problem difficulty and answer verifiability. On the synthesized benchmark, state-of-the-art models exhibit significantly degraded accuracy (40–60%), substantially underperforming on conventional benchmarks. All datasets and implementation code are publicly released.

Technology Category

Application Category

📝 Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

Problem

Research questions and friction points this paper is trying to address.

Generating challenging problems for LLM evaluation

Automating problem synthesis without human involvement

Creating diverse benchmarks for rigorous LLM assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based synthetic problem generation

Bottom-up hard problem construction

Independent verifiable sub-task decomposition

🔎 Similar Papers

Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models