🤖 AI Summary
In high-stakes applications, the absence of unified, scalable evaluation tools hinders reliable prediction of large language model (LLM) output veracity. To address this, we introduce and open-source a comprehensive Python library—the first to systematically integrate over 30 veracity prediction methods across multiple dimensions: black-box vs. white-box, self-supervised vs. supervised, and reference-document–dependent vs.–independent paradigms. The library supports both Hugging Face and LiteLLM ecosystems, accommodates local and API-hosted LLMs, and provides end-to-end functionality for generation, evaluation, calibration, and long-context veracity prediction. Empirical validation on TriviaQA, GSM8K, and FactScore-Bio demonstrates consistent improvements in prediction accuracy and robustness. Our work significantly enhances reproducibility and usability, bridging a critical gap between methodological diversity and engineering practice in LLM veracity prediction.
📝 Abstract
Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM