π€ AI Summary
To address the safety risk of hallucinated, unsupported responses from large language models (LLMs) in finance, this paper proposes ECLIPSEβa novel framework that formally characterizes hallucination as a mismatch between semantic entropy and evidence capacity, proving its objective function is strictly convex with a unique stable optimum. Methodologically, ECLIPSE innovatively estimates semantic entropy via multi-sample clustering and quantifies model reliance on retrieved evidence through token-level perplexity decomposition. A key empirical finding is that token-level log-probability uncertainty serves as a decisive signal for hallucination detection. Evaluated on a financial question-answering benchmark, ECLIPSE achieves 0.89 ROC AUC and 0.90 mean precision, significantly outperforming all baselines. Ablation studies confirm that performance gains stem directly from calibrated token-level probability modeling, rather than architectural or retrieval enhancements.
π Abstract
Large language models (LLMs) produce fluent but unsupported answers - hallucinations - limiting safe deployment in high-stakes domains. We propose ECLIPSE, a framework that treats hallucination as a mismatch between a model's semantic entropy and the capacity of available evidence. We combine entropy estimation via multi-sample clustering with a novel perplexity decomposition that measures how models use retrieved evidence. We prove that under mild conditions, the resulting entropy-capacity objective is strictly convex with a unique stable optimum. We evaluate on a controlled financial question answering dataset with GPT-3.5-turbo (n=200 balanced samples with synthetic hallucinations), where ECLIPSE achieves ROC AUC of 0.89 and average precision of 0.90, substantially outperforming a semantic entropy-only baseline (AUC 0.50). A controlled ablation with Claude-3-Haiku, which lacks token-level log probabilities, shows AUC dropping to 0.59 with coefficient magnitudes decreasing by 95% - demonstrating that ECLIPSE is a logprob-native mechanism whose effectiveness depends on calibrated token-level uncertainties. The perplexity decomposition features exhibit the largest learned coefficients, confirming that evidence utilization is central to hallucination detection. We position this work as a controlled mechanism study; broader validation across domains and naturally occurring hallucinations remains future work.