Evolutionary Search for Automated Design of Uncertainty Quantification Methods

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limited generalizability of existing handcrafted uncertainty quantification methods for large language models (LLMs), which often rely on domain-specific knowledge and heuristic rules. The paper introduces, for the first time, an LLM-driven evolutionary search framework that automatically discovers unsupervised uncertainty quantification procedures expressed as Python programs, specifically tailored for atomic fact-checking tasks. By integrating program synthesis with evolutionary optimization, the proposed approach achieves up to a 6.7% improvement in ROC-AUC over the best human-designed methods across nine benchmark datasets and demonstrates exceptional out-of-distribution generalization. Furthermore, the study uncovers fundamental differences among LLMs when employed within evolutionary strategies, establishing a novel paradigm for interpretable uncertainty quantification.

Technology Category

Application Category

📝 Abstract

Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance -- Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results indicate that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.

Problem

Research questions and friction points this paper is trying to address.

uncertainty quantification

large language models

automated design

scalability

generality

Innovation

Methods, ideas, or system contributions that make the work stand out.

evolutionary search

uncertainty quantification

large language models