🤖 AI Summary
Generative AI has triggered an explosion in data volume, rendering traditional nonlinear dimensionality reduction methods—such as t-SNE and UMAP—ineffective for scaling to million-scale unstructured embeddings, thereby severely hindering exploratory data analysis in AI interpretability.
Method: We propose the first scalable visualization framework supporting multi-GPU distributed training. It introduces an information-theoretic upper bound approximation of the InfoNC-t-SNE loss, integrated with deep metric learning that combines negative sampling and mean-affinity discrimination.
Contribution/Results: Our framework achieves end-to-end mapping of the full Multilingual Wikipedia embedding corpus (>10 million entries). Experiments demonstrate substantial improvements over state-of-the-art methods in both speed and visualization quality. Notably, it produces the first global semantic map for multi-lingual text embeddings at the ten-million scale, establishing a novel paradigm for large-scale AI interpretability.
📝 Abstract
The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. We provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. We demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia.