SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of assessing trustworthiness of large language models (LLMs) in high-stakes scientific research. Methodologically, it introduces the first four-dimensional evaluation framework tailored for scientific applications—encompassing factual accuracy, adversarial robustness, scientific safety, and scientific ethics. It proposes a verification-based reflective fine-tuning paradigm coupled with expert validation for an open factual accuracy benchmark; designs a novel scientific ethics benchmark covering eight subcategories; and integrates open-ended factual evaluation, semantic similarity metrics, LLM self-scoring, and adversarial testing into a multi-metric quantitative assessment system. A key contribution is the first systematic, multi-dimensional comparison of domain-specific scientific LMs versus general-purpose LMs across trustworthiness dimensions. Results indicate that general-purpose models consistently outperform scientific models overall, with GPT-4-mini achieving top performance in factual accuracy and adversarial robustness. In contrast, scientific models exhibit significant deficiencies in logical rigor, ethical alignment, and biosafety—critical areas in high-risk scientific contexts.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM trustworthiness in scientific applications across multiple dimensions
Assessing truthfulness and ethics of language models in research contexts
Identifying safety vulnerabilities in specialized scientific AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel open-ended truthfulness benchmarks via reflection-tuning
Ethics benchmark covering dual-use research and bias
Comprehensive evaluation across four trustworthiness dimensions
🔎 Similar Papers
No similar papers found.