🤖 AI Summary
Existing foundational model evaluation benchmarks often suffer from insufficient fine-grained coverage and a lack of metadata, limiting their ability to comprehensively assess model capabilities. This work proposes an automated benchmark generation framework that uniquely integrates a multi-agent system with a solution-graph-driven question-generation mechanism to synthesize high-quality, contamination-resistant evaluation items from reference materials such as textbooks, while automatically annotating fine-grained metadata. The approach substantially reduces ground-truth error rates and achieves near-uniform coverage across capability dimensions. Leveraging this framework, we construct three new benchmarks in machine learning, corporate finance, and personal finance. Expert evaluations demonstrate that these benchmarks exhibit significantly lower error rates than MMLU and GSM8K and uncover performance differences among models that existing benchmarks fail to detect.
📝 Abstract
Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.