Towards Robustness: A Critique of Current Vector Database Assessments

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vector database evaluation over-relies on mean recall, obscuring inter-query performance variance and undermining robustness on hard queries—thereby compromising reliability in downstream tasks like RAG. This work advocates shifting from “average performance” to a “distributional robustness” evaluation paradigm. We propose Robustness-δ@K, a novel metric quantifying performance consistency in the tail of the recall distribution (e.g., the worst 10% of queries). We further design a thresholded query coverage metric and integrate it into mainstream benchmarks (BEIR, MTEB) to re-evaluate prominent indexes—including HNSW, IVF, and DiskANN. Experiments show that, at identical mean recall, indexes with higher Robustness-δ@K yield up to 12.3% improvement in RAG answer accuracy. Moreover, we identify graph connectivity and cluster balance as critical architectural factors governing retrieval robustness.

Technology Category

Application Category

📝 Abstract
Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-$δ$@K, a new metric that captures the fraction of queries with recall above a threshold $δ$. This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-$δ$@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.
Problem

Research questions and friction points this paper is trying to address.

Current vector database evaluations overly rely on average recall.
Average recall masks performance variability on hard queries.
Proposing Robustness-δ@K to better assess query-level recall consistency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Robustness-δ@K metric
Evaluates recall distribution variability
Identifies design factors for robustness
🔎 Similar Papers
No similar papers found.
Z
Zikai Wang
Northeastern University
Qianxi Zhang
Qianxi Zhang
MSRA
database
Baotong Lu
Baotong Lu
Microsoft Research
Database SystemsMachine Learning Systems
Q
Qi Chen
Microsoft Research
C
Cheng Tan
Northeastern University