Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing foundational model evaluation benchmarks often suffer from insufficient fine-grained coverage and a lack of metadata, limiting their ability to comprehensively assess model capabilities. This work proposes an automated benchmark generation framework that uniquely integrates a multi-agent system with a solution-graph-driven question-generation mechanism to synthesize high-quality, contamination-resistant evaluation items from reference materials such as textbooks, while automatically annotating fine-grained metadata. The approach substantially reduces ground-truth error rates and achieves near-uniform coverage across capability dimensions. Leveraging this framework, we construct three new benchmarks in machine learning, corporate finance, and personal finance. Expert evaluations demonstrate that these benchmarks exhibit significantly lower error rates than MMLU and GSM8K and uncover performance differences among models that existing benchmarks fail to detect.

📝 Abstract

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

Problem

Research questions and friction points this paper is trying to address.

foundation models

benchmark generation

fine-grained evaluation

comprehensive coverage

ground truth reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated benchmark generation

multi-agent architecture

solution-graph-driven