Transform Before You Query: A Privacy-Preserving Approach for Vector Retrieval with Embedding Space Alignment

πŸ“… 2025-07-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vector database (VDB) services rely on opaque, server-side embedding models, requiring users to upload raw query textβ€”posing severe privacy risks in sensitive domains such as finance and healthcare. To address this, we propose STEER, a novel framework that enables end-to-end privacy-preserving retrieval without modifying the VDB server. STEER leverages cross-model semantic space alignment to generate high-fidelity approximate embeddings for queries directly on the client side, eliminating raw-text exposure and defending against embedding inversion attacks. We provide theoretical analysis of embedding approximation error and integrate retrieval-aware optimization to preserve accuracy. Experiments on million-scale benchmarks demonstrate that STEER achieves a 20% improvement in Recall@20 over baseline methods, with Recall@100 degrading by less than 5%, significantly outperforming existing privacy-preserving retrieval approaches.

Technology Category

Application Category

πŸ“ Abstract
Vector Database (VDB) can efficiently index and search high-dimensional vector embeddings from unstructured data, crucially enabling fast semantic similarity search essential for modern AI applications like generative AI and recommendation systems. Since current VDB service providers predominantly use proprietary black-box models, users are forced to expose raw query text to them via API in exchange for the vector retrieval services. Consequently, if query text involves confidential records from finance or healthcare domains, this mechanism inevitably leads to critical leakage of user's sensitive information. To address this issue, we introduce STEER ( extbf{S}ecure extbf{T}ransformed extbf{E}mbedding v extbf{E}ctor extbf{ R}etrieval), a private vector retrieval framework that leverages the alignment relationship between the semantic spaces of different embedding models to derive approximate embeddings for the query text. STEER performs the retrieval using the approximate embeddings within the original VDB and requires no modifications to the server side. Our theoretical and experimental analyses demonstrate that STEER effectively safeguards query text privacy while maintaining the retrieval accuracy. Even though approximate embeddings are approximations of the embeddings from proprietary models, they still prevent the providers from recovering the query text through Embedding Inversion Attacks (EIAs). Extensive experimental results show that Recall@100 of STEER can basically achieve a decrease of less than 5%. Furthermore, even when searching within a text corpus of millions of entries, STEER achieves a Recall@20 accuracy 20% higher than current baselines.
Problem

Research questions and friction points this paper is trying to address.

Prevents sensitive query text exposure to proprietary black-box models
Aligns embedding spaces for secure approximate vector retrieval
Maintains retrieval accuracy while protecting against embedding inversion attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages embedding space alignment for privacy
Uses approximate embeddings to prevent data leakage
Maintains retrieval accuracy without server modifications
πŸ”Ž Similar Papers
No similar papers found.
R
Ruiqi He
Nankai University
Zekun Fei
Zekun Fei
Nankai University
Data SecurityAI Security
J
Jiaqi Li
Nankai University
X
Xinyuan Zhu
Nankai University
Biao Yi
Biao Yi
Nankai University
LLM SecurityTrustworthy LLMSteganography
S
Siyi Lv
Nankai University
Weijie Liu
Weijie Liu
Nankai University
System SecurityVirtualizationBinary AnalysisImage Fusion
Z
Zheli Liu
Nankai University