SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current evaluations of scientific capabilities in large language models predominantly rely on manual annotation or general-purpose benchmarks, which struggle to capture the fine-grained, application-oriented competencies required in real-world research and offer limited scalability. This work proposes SciCustom, a novel framework that enables fine-grained, domain-specific assessment without expert annotations or synthetic questions. SciCustom constructs customizable evaluation benchmarks by organizing knowledge units through ontology guidance, filtering via multi-model consensus, applying binary search for efficient retrieval, and generating data-driven subsets. Experiments in chemistry and healthcare demonstrate that SciCustom effectively uncovers model performance differences invisible to standard benchmarks, substantially enhancing the relevance, practicality, and scalability of scientific capability evaluation.

📝 Abstract

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

Problem

Research questions and friction points this paper is trying to address.

scientific evaluation

large language models

benchmarking

fine-grained capabilities

domain-specific assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

custom benchmarking

ontology-grounded knowledge

multi-model consensus