π€ AI Summary
Existing vector database (VDB) services rely on opaque, server-side embedding models, requiring users to upload raw query textβposing severe privacy risks in sensitive domains such as finance and healthcare. To address this, we propose STEER, a novel framework that enables end-to-end privacy-preserving retrieval without modifying the VDB server. STEER leverages cross-model semantic space alignment to generate high-fidelity approximate embeddings for queries directly on the client side, eliminating raw-text exposure and defending against embedding inversion attacks. We provide theoretical analysis of embedding approximation error and integrate retrieval-aware optimization to preserve accuracy. Experiments on million-scale benchmarks demonstrate that STEER achieves a 20% improvement in Recall@20 over baseline methods, with Recall@100 degrading by less than 5%, significantly outperforming existing privacy-preserving retrieval approaches.
π Abstract
Vector Database (VDB) can efficiently index and search high-dimensional vector embeddings from unstructured data, crucially enabling fast semantic similarity search essential for modern AI applications like generative AI and recommendation systems. Since current VDB service providers predominantly use proprietary black-box models, users are forced to expose raw query text to them via API in exchange for the vector retrieval services. Consequently, if query text involves confidential records from finance or healthcare domains, this mechanism inevitably leads to critical leakage of user's sensitive information. To address this issue, we introduce STEER ( extbf{S}ecure extbf{T}ransformed extbf{E}mbedding v extbf{E}ctor extbf{ R}etrieval), a private vector retrieval framework that leverages the alignment relationship between the semantic spaces of different embedding models to derive approximate embeddings for the query text. STEER performs the retrieval using the approximate embeddings within the original VDB and requires no modifications to the server side. Our theoretical and experimental analyses demonstrate that STEER effectively safeguards query text privacy while maintaining the retrieval accuracy. Even though approximate embeddings are approximations of the embeddings from proprietary models, they still prevent the providers from recovering the query text through Embedding Inversion Attacks (EIAs). Extensive experimental results show that Recall@100 of STEER can basically achieve a decrease of less than 5%. Furthermore, even when searching within a text corpus of millions of entries, STEER achieves a Recall@20 accuracy 20% higher than current baselines.