š¤ AI Summary
Large language models (LLMs) generate SQL queries for supply-chain data extraction without quantifying uncertainty, hindering reliability assessment. Method: We propose a confidence estimation method based on cross-modal embedding similarity, integrating dual-modality embeddings (e.g., text-embedding-3-large), cross-lingual back-translation verification, and meta-prompted confidence generation. Contribution/Results: We systematically compare three strategiesātranslation consistency, embedding semantic similarity, and LLM-reported confidenceāand find that SQLānatural-language cross-modal embedding similarity significantly outperforms overconfident self-reported scores in accuracy discrimination. Our method achieves an AUC of 0.89 for SQL correctness identificationā32% higher than self-reported confidenceāenabling effective human review prioritization and automated query rejection in high-risk scenarios.
š Abstract
Large Language Models (LLMs) have recently enabled natural language interfaces that translate user queries into executable SQL, offering a powerful solution for non-technical stakeholders to access structured data. However, one of the limitation that LLMs do not natively express uncertainty makes it difficult to assess the reliability of their generated queries. This paper presents a case study that evaluates multiple approaches to estimate confidence scores for LLM-generated SQL in supply chain data retrieval. We investigated three strategies: (1) translation-based consistency checks; (2) embedding-based semantic similarity between user questions and generated SQL; and (3) self-reported confidence scores directly produced by the LLM. Our findings reveal that LLMs are often overconfident in their own outputs, which limits the effectiveness of self-reported confidence. In contrast, embedding-based similarity methods demonstrate strong discriminative power in identifying inaccurate SQL.