Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of quantifying joint uncertainty in tool-augmented large language models (LLMs) for high-stakes domains such as healthcare, this paper proposes the first framework for joint uncertainty modeling in LLM-tool collaborative reasoning. Methodologically, it extends sequence-level uncertainty estimation to the tool-calling paradigm by integrating Bayesian inference, uncertainty propagation, confidence calibration of tool outputs, and synthetic QA data construction, supported by an efficient approximation algorithm. Contributions include: (i) the first unified modeling of both LLM-generated uncertainty and predictive uncertainty from external tools (e.g., classifiers, retrieval systems); and (ii) empirical validation on two newly constructed tool-dependent QA benchmarks and a real-world RAG system, demonstrating significant improvements in answer credibility discrimination accuracy—thereby providing reliable uncertainty estimates for high-risk decision-making.

Technology Category

Application Category

📝 Abstract
Modern Large Language Models (LLMs) often require external tools, such as machine learning classifiers or knowledge retrieval systems, to provide accurate answers in domains where their pre-trained knowledge is insufficient. This integration of LLMs with external tools expands their utility but also introduces a critical challenge: determining the trustworthiness of responses generated by the combined system. In high-stakes applications, such as medical decision-making, it is essential to assess the uncertainty of both the LLM's generated text and the tool's output to ensure the reliability of the final response. However, existing uncertainty quantification methods do not account for the tool-calling scenario, where both the LLM and external tool contribute to the overall system's uncertainty. In this work, we present a novel framework for modeling tool-calling LLMs that quantifies uncertainty by jointly considering the predictive uncertainty of the LLM and the external tool. We extend previous methods for uncertainty quantification over token sequences to this setting and propose efficient approximations that make uncertainty computation practical for real-world applications. We evaluate our framework on two new synthetic QA datasets, derived from well-known machine learning datasets, which require tool-calling for accurate answers. Additionally, we apply our method to retrieval-augmented generation (RAG) systems and conduct a proof-of-concept experiment demonstrating the effectiveness of our uncertainty metrics in scenarios where external information retrieval is needed. Our results show that the framework is effective in enhancing trust in LLM-based systems, especially in cases where the LLM's internal knowledge is insufficient and external tools are required.
Problem

Research questions and friction points this paper is trying to address.

Quantifying uncertainty in LLM-tool systems for reliable answers
Assessing combined uncertainty of LLM and external tool outputs
Enhancing trust in tool-augmented LLMs for high-stakes applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly models LLM and tool uncertainty
Extends token sequence uncertainty methods
Evaluates on synthetic QA datasets
🔎 Similar Papers
No similar papers found.