๐ค AI Summary
To address high latency in KNN queries caused by high-dimensional embeddings in multimodal scientific data semantic retrieval, this paper proposes Order-Preserving Dimensionality Reduction (OPDR). OPDR significantly compresses embedding dimensions while strictly preserving the top-k nearest-neighbor ordering. Its core innovation lies in the first formal definition of a KNN similarity metric, the construction of a global accuracy model, and the derivation of an analytical closed-form solution linking the target dimension to other parametersโthereby guaranteeing theoretically provable order preservation. OPDR is compatible with standard dimensionality reduction techniques (e.g., random projection, PCA) and supports both Euclidean and cosine distances, as well as mainstream multimodal embedding models including CLIP and SciBERT. Evaluated on multiple scientific datasets, OPDR reduces embedding dimensionality to 5โ10% of the original while maintaining 100% top-k recall and accelerating KNN query latency by 3โ8ร.
๐ Abstract
One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items (or, $k$-nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a extit{semantic} index, the so-called extit{embedding vectors} mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models.