π€ AI Summary
Existing static benchmarks are vulnerable to data contamination and overfitting, limiting their ability to faithfully evaluate large language models on knowledge-intensive reasoning tasks, while dynamic benchmarks often compromise answerability and controllability. This work proposes StressEval, a novel βfailure-drivenβ dynamic evaluation framework that constructs semi-structured difficulty cards based on model failure analysis and employs a dual-perspective instance synthesis method to generate test samples targeting knowledge gaps and reasoning breakdowns. A gating mechanism ensures the groundedness and unambiguousness of the generated samples. The resulting Dynamic OneEval benchmark substantially reduces model performance across mainstream systems while explicitly preserving controllable difficulty factors, thereby effectively exposing model weaknesses and providing actionable feedback for iterative improvement.
π Abstract
Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive reasoning datasets we employ StressEval to build Dynamic OneEval a focused suite of challenging dynamic benchmark Across several state of the art LLMs Dynamic OneEval yields substantially larger performance drops than the original benchmarks while retaining explicit difficulty factors enabling more actionable iteration