Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing domain-specific LLM benchmarks overemphasize dataset scale while neglecting the impact of corpus curation and question-set design on accuracy and recall. Method: We propose Comp-Comp, the first framework to jointly optimize corpus selection and question generation via a comprehensive–compact trade-off mechanism—leveraging semantic coverage measurement and information density analysis to iteratively construct XUBench, a high-quality closed-domain benchmark tailored to university academic scenarios. Contribution/Results: Comp-Comp departs from the “scale-driven” paradigm, enabling compact yet highly representative benchmark design that achieves superior coverage, precision, and evaluation efficiency. XUBench establishes a novel, extensible paradigm for multi-domain benchmark development; empirical results demonstrate its outperformance over conventional scale-expansion approaches in both effectiveness and resource efficiency.

Technology Category

Application Category

📝 Abstract

Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.

Problem

Research questions and friction points this paper is trying to address.

Evaluating impact of corpus design on domain-specific LLM precision

Assessing question-answer set design effects on LLM recall

Developing comprehensive yet compact domain-specific benchmarking framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative benchmarking framework based on comprehensiveness-compactness

Ensures semantic recall while enhancing precision

Extensible framework guiding corpus and QA construction

🔎 Similar Papers

metabench -- A Sparse Benchmark of Reasoning and Knowledge in Large Language Models