Transform Before You Query: A Privacy-Preserving Approach for Vector Retrieval with Embedding Space Alignment

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing vector database (VDB) services rely on opaque, server-side embedding models, requiring users to upload raw query text—posing severe privacy risks in sensitive domains such as finance and healthcare. To address this, we propose STEER, a novel framework that enables end-to-end privacy-preserving retrieval without modifying the VDB server. STEER leverages cross-model semantic space alignment to generate high-fidelity approximate embeddings for queries directly on the client side, eliminating raw-text exposure and defending against embedding inversion attacks. We provide theoretical analysis of embedding approximation error and integrate retrieval-aware optimization to preserve accuracy. Experiments on million-scale benchmarks demonstrate that STEER achieves a 20% improvement in Recall@20 over baseline methods, with Recall@100 degrading by less than 5%, significantly outperforming existing privacy-preserving retrieval approaches.

Technology Category

Application Category

📝 Abstract

Vector Database (VDB) can efficiently index and search high-dimensional vector embeddings from unstructured data, crucially enabling fast semantic similarity search essential for modern AI applications like generative AI and recommendation systems. Since current VDB service providers predominantly use proprietary black-box models, users are forced to expose raw query text to them via API in exchange for the vector retrieval services. Consequently, if query text involves confidential records from finance or healthcare domains, this mechanism inevitably leads to critical leakage of user's sensitive information. To address this issue, we introduce STEER ( extbf{S}ecure extbf{T}ransformed extbf{E}mbedding v extbf{E}ctor extbf{ R}etrieval), a private vector retrieval framework that leverages the alignment relationship between the semantic spaces of different embedding models to derive approximate embeddings for the query text. STEER performs the retrieval using the approximate embeddings within the original VDB and requires no modifications to the server side. Our theoretical and experimental analyses demonstrate that STEER effectively safeguards query text privacy while maintaining the retrieval accuracy. Even though approximate embeddings are approximations of the embeddings from proprietary models, they still prevent the providers from recovering the query text through Embedding Inversion Attacks (EIAs). Extensive experimental results show that Recall@100 of STEER can basically achieve a decrease of less than 5%. Furthermore, even when searching within a text corpus of millions of entries, STEER achieves a Recall@20 accuracy 20% higher than current baselines.

Problem

Research questions and friction points this paper is trying to address.

Prevents sensitive query text exposure to proprietary black-box models

Aligns embedding spaces for secure approximate vector retrieval

Maintains retrieval accuracy while protecting against embedding inversion attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages embedding space alignment for privacy

Uses approximate embeddings to prevent data leakage

Maintains retrieval accuracy without server modifications

🔎 Similar Papers

PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration