🤖 AI Summary
This study addresses the challenge of assessing trustworthiness of large language models (LLMs) in high-stakes scientific research. Methodologically, it introduces the first four-dimensional evaluation framework tailored for scientific applications—encompassing factual accuracy, adversarial robustness, scientific safety, and scientific ethics. It proposes a verification-based reflective fine-tuning paradigm coupled with expert validation for an open factual accuracy benchmark; designs a novel scientific ethics benchmark covering eight subcategories; and integrates open-ended factual evaluation, semantic similarity metrics, LLM self-scoring, and adversarial testing into a multi-metric quantitative assessment system. A key contribution is the first systematic, multi-dimensional comparison of domain-specific scientific LMs versus general-purpose LMs across trustworthiness dimensions. Results indicate that general-purpose models consistently outperform scientific models overall, with GPT-4-mini achieving top performance in factual accuracy and adversarial robustness. In contrast, scientific models exhibit significant deficiencies in logical rigor, ethical alignment, and biosafety—critical areas in high-risk scientific contexts.
📝 Abstract
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.