SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

195K/year
๐Ÿค– AI Summary
Current evaluations of scientific capabilities in large language models predominantly rely on manual annotation or general-purpose benchmarks, which struggle to capture the fine-grained, application-oriented competencies required in real-world research and offer limited scalability. This work proposes SciCustom, a novel framework that enables fine-grained, domain-specific assessment without expert annotations or synthetic questions. SciCustom constructs customizable evaluation benchmarks by organizing knowledge units through ontology guidance, filtering via multi-model consensus, applying binary search for efficient retrieval, and generating data-driven subsets. Experiments in chemistry and healthcare demonstrate that SciCustom effectively uncovers model performance differences invisible to standard benchmarks, substantially enhancing the relevance, practicality, and scalability of scientific capability evaluation.
๐Ÿ“ Abstract
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.
Problem

Research questions and friction points this paper is trying to address.

scientific evaluation
large language models
benchmarking
fine-grained capabilities
domain-specific assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

custom benchmarking
ontology-grounded knowledge
multi-model consensus
relevance-aware retrieval
scientific LLM evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.