๐ค AI Summary
Current evaluations of scientific capabilities in large language models predominantly rely on manual annotation or general-purpose benchmarks, which struggle to capture the fine-grained, application-oriented competencies required in real-world research and offer limited scalability. This work proposes SciCustom, a novel framework that enables fine-grained, domain-specific assessment without expert annotations or synthetic questions. SciCustom constructs customizable evaluation benchmarks by organizing knowledge units through ontology guidance, filtering via multi-model consensus, applying binary search for efficient retrieval, and generating data-driven subsets. Experiments in chemistry and healthcare demonstrate that SciCustom effectively uncovers model performance differences invisible to standard benchmarks, substantially enhancing the relevance, practicality, and scalability of scientific capability evaluation.
๐ Abstract
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.