π€ AI Summary
To address the high computational cost and poor robustness of factuality assessment in long-text generation by large language models (LLMs), this paper proposes the Semantic Isotropy Index (SII)βan unsupervised metric quantifying semantic uniformity of generated content via angular dispersion of normalized text embeddings on the unit hypersphere. SII detects factual inconsistency without requiring labeled data, model fine-tuning, or hyperparameter optimization, achieving stable prediction with only a few samples. Evaluated across diverse open-domain question-answering tasks, SII substantially outperforms existing fact-checking and consistency evaluation methods in accuracy while incurring minimal computational overhead and enabling straightforward deployment. Its efficiency, scalability, and annotation-free design establish a novel, practical paradigm for trustworthiness assessment in real-world LLM applications.
π Abstract
To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater embedding dispersion -- reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples -- offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.