Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems

📅 2024-06-20

🏛️ North American Chapter of the Association for Computational Linguistics

📈 Citations: 1

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Text-to-SQL systems rely on database schemas and are vulnerable to schema inference attacks, potentially leaking sensitive information. This paper introduces the first black-box, zero-prior-knowledge schema reverse-engineering framework. Our method employs semantic-driven probe query generation, GPT-4-assisted output parsing, and a joint table/column/type identification and validation algorithm to achieve end-to-end schema reconstruction. We establish the novel paradigm of “zero-knowledge schema inference,” demonstrating that generative models (e.g., GPT-4, table-name F1 = 0.99) exhibit significantly higher structural leakage than fine-tuned models (F1 = 0.78). Furthermore, we empirically show that prevalent defense mechanisms—including output filtering—fail universally against such attacks. This work exposes a critical, previously overlooked vulnerability in production-grade text-to-SQL systems and provides both a systematic attack methodology and foundational insights for future robustness research.

Technology Category

Application Category

📝 Abstract

Text-to-SQL systems empower users to interact with databases using natural language, automatically translating queries into executable SQL code. However, their reliance on database schema information for SQL generation exposes them to significant security vulnerabilities, particularly schema inference attacks that can lead to unauthorized data access or manipulation. In this paper, we introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-SQL models without any prior knowledge of the database. Our approach systematically probes text-to-SQL models with specially crafted questions and leverages a surrogate GPT-4 model to interpret the outputs, effectively uncovering hidden schema elements -- including tables, columns, and data types. We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to .99 for generative models and .78 for fine-tuned models, underscoring the severity of schema leakage risks. Furthermore, we propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks.

Problem

Research questions and friction points this paper is trying to address.

Exposing security risks in text-to-SQL systems via schema inference attacks

Reconstructing database schemas without prior knowledge using zero-knowledge framework

Demonstrating high accuracy in schema leakage and proposing limited protections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-knowledge framework for schema inference

Probes models with crafted questions

Uses surrogate GPT-4 to interpret outputs

🔎 Similar Papers

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection