🤖 AI Summary
This work addresses the critical challenge of ensuring factual reliability in large language models (LLMs) within high-stakes applications, where existing conformal inference methods often prove either overly conservative or ill-equipped to handle complex grouping structures, leading to excessive rejection of valid statements. To overcome these limitations, the authors propose a novel multi-LLM adaptive conformal inference framework that models factuality as a product of statement-level scores and leverages ensemble scoring across multiple LLMs to enhance accuracy. The approach further incorporates grouped conditional calibration and an adaptive filtering mechanism, which jointly maximize the retention of true statements while strictly adhering to user-specified coverage guarantees and reducing computational overhead. Experimental results demonstrate that the method achieves superior performance over current baselines without compromising statistical validity.
📝 Abstract
Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI