🤖 AI Summary
This work addresses the limitation of current large language model evaluations, which predominantly focus on answer correctness while neglecting the adherence to scientific reasoning norms. To bridge this gap, the paper introduces the concept of “scientific instruction-following ability” and presents SciIF, a multidisciplinary benchmark that systematically assesses a model’s capacity to explicitly comply with scientific conventions during problem-solving. SciIF employs three structured constraints—scientific conditions, semantic stability, and domain-specific reasoning processes—and implements a constraint-decoupled evaluation framework combining human and automated validation. This approach uniquely reframes scientific validity as an auditable instruction-following task, effectively distinguishing between responses that are merely correct but violate scientific reasoning protocols and those that are both correct and compliant. The benchmark thus establishes a fine-grained, interpretable standard for evaluating the reliability of scientific agents.
📝 Abstract
As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.