It's LIT! Reliability-Optimized LLMs with Inspectable Tools

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from opaque reasoning and unreliable multi-step tool invocation paths, hindering their trustworthy deployment in high-stakes applications. To address this, we propose LIT—a verifiable external tool-calling framework that explicitly models and optimizes both reliability and debuggability of multi-step tool execution sequences via a customized reliability cost function. We construct a benchmark dataset comprising 1,300 problems, integrating diverse tools including calculators, predictive models, and random forests, and grounding tasks in real-world data from USPTO patents and NeurIPS 2023 papers. Experiments demonstrate that LIT significantly improves decision reliability and maintainability across mathematical reasoning, programming, and modeling tasks—without degrading original task performance. Our core contribution lies in formalizing reliability as an explicit, differentiable optimization objective, thereby enabling transparent, auditable, and verifiable selection of tool-calling paths.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external -- more reliable -- tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.
Problem

Research questions and friction points this paper is trying to address.

LLMs have opaque reasoning processes limiting trust in high-stakes domains
LLMs often choose unreliable solutions that are difficult to troubleshoot
Need framework to force LLMs to use reliable, inspectable external tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Forcing LLMs to use reliable external tools
Selecting most reliable and troubleshootable solution paths
Introducing reliability cost functions for tool evaluation
🔎 Similar Papers
No similar papers found.