🤖 AI Summary
This work addresses the limited generalizability of existing handcrafted uncertainty quantification methods for large language models (LLMs), which often rely on domain-specific knowledge and heuristic rules. The paper introduces, for the first time, an LLM-driven evolutionary search framework that automatically discovers unsupervised uncertainty quantification procedures expressed as Python programs, specifically tailored for atomic fact-checking tasks. By integrating program synthesis with evolutionary optimization, the proposed approach achieves up to a 6.7% improvement in ROC-AUC over the best human-designed methods across nine benchmark datasets and demonstrates exceptional out-of-distribution generalization. Furthermore, the study uncovers fundamental differences among LLMs when employed within evolutionary strategies, establishing a novel paradigm for interpretable uncertainty quantification.
📝 Abstract
Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.