Confidence Scoring for LLM-Generated SQL in Supply Chain Data Extraction

šŸ“… 2025-06-20
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Large language models (LLMs) generate SQL queries for supply-chain data extraction without quantifying uncertainty, hindering reliability assessment. Method: We propose a confidence estimation method based on cross-modal embedding similarity, integrating dual-modality embeddings (e.g., text-embedding-3-large), cross-lingual back-translation verification, and meta-prompted confidence generation. Contribution/Results: We systematically compare three strategies—translation consistency, embedding semantic similarity, and LLM-reported confidence—and find that SQL–natural-language cross-modal embedding similarity significantly outperforms overconfident self-reported scores in accuracy discrimination. Our method achieves an AUC of 0.89 for SQL correctness identification—32% higher than self-reported confidence—enabling effective human review prioritization and automated query rejection in high-risk scenarios.

Technology Category

Application Category

šŸ“ Abstract
Large Language Models (LLMs) have recently enabled natural language interfaces that translate user queries into executable SQL, offering a powerful solution for non-technical stakeholders to access structured data. However, one of the limitation that LLMs do not natively express uncertainty makes it difficult to assess the reliability of their generated queries. This paper presents a case study that evaluates multiple approaches to estimate confidence scores for LLM-generated SQL in supply chain data retrieval. We investigated three strategies: (1) translation-based consistency checks; (2) embedding-based semantic similarity between user questions and generated SQL; and (3) self-reported confidence scores directly produced by the LLM. Our findings reveal that LLMs are often overconfident in their own outputs, which limits the effectiveness of self-reported confidence. In contrast, embedding-based similarity methods demonstrate strong discriminative power in identifying inaccurate SQL.
Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of LLM-generated SQL queries
Evaluating confidence scores for supply chain data extraction
Addressing LLM overconfidence in self-reported SQL accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translation-based consistency checks for SQL
Embedding-based semantic similarity assessment
LLM self-reported confidence scoring
šŸ”Ž Similar Papers