Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Large language models (LLMs) frequently exhibit overconfidence in high-stakes domains (e.g., healthcare, finance), leading to severe errors. To address this, we propose a decision-enhancement framework integrating insights from the Monty Hall problem with conformal prediction (CP). Our contributions are: (1) CP-OPT—the first CP score optimization method tailored to LLM logits, jointly minimizing prediction set size while guaranteeing theoretical coverage; and (2) CROQ—a dynamic multiple-choice option recalibration mechanism grounded in conformal prediction sets, directly converting uncertainty quantification into accuracy gains. On MMLU, ToolAlpaca, and TruthfulQA, CP-OPT reduces average prediction set size by 37% while strictly maintaining coverage guarantees. CROQ improves accuracy by 5.2–8.9 percentage points over standard inference across Gemma-2, Llama-3, and Phi-3, with further gains under CP-OPT guidance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a emph{score function} into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP's utility beyond uncertainty quantification to improve accuracy. We propose emph{conformal revision of questions} (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM's chances of answering it correctly. Experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with Gemma-2, Llama-3 and Phi-3 models show that CP-OPT significantly reduces set sizes while maintaining coverage, and CROQ improves accuracy over the standard inference, especially when paired with CP-OPT scores. Together, CP-OPT and CROQ offer a robust framework for improving both the safety and accuracy of LLM-driven decision-making.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Accuracy Improvement

Reliability Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

CP-OPT

CROQ

Decision Accuracy Enhancement

🔎 Similar Papers

Efficient Sequential Decision Making with Large Language Models