🤖 AI Summary
Current vector database evaluation over-relies on mean recall, obscuring inter-query performance variance and undermining robustness on hard queries—thereby compromising reliability in downstream tasks like RAG. This work advocates shifting from “average performance” to a “distributional robustness” evaluation paradigm. We propose Robustness-δ@K, a novel metric quantifying performance consistency in the tail of the recall distribution (e.g., the worst 10% of queries). We further design a thresholded query coverage metric and integrate it into mainstream benchmarks (BEIR, MTEB) to re-evaluate prominent indexes—including HNSW, IVF, and DiskANN. Experiments show that, at identical mean recall, indexes with higher Robustness-δ@K yield up to 12.3% improvement in RAG answer accuracy. Moreover, we identify graph connectivity and cluster balance as critical architectural factors governing retrieval robustness.
📝 Abstract
Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-$δ$@K, a new metric that captures the fraction of queries with recall above a threshold $δ$. This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-$δ$@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.