π€ AI Summary
To address the uncontrollable topology of latent spaces (LS) in autoencoders (AEs), this paper proposes a geometric-loss-guided supervised AE co-optimization frameworkβthe first to explicitly configure LS topology in supervised AEs. The method jointly optimizes encoder architecture and geometric constraint losses, enabling user-defined cluster positions and shapes, decoder-free label prediction, and cross-sample similarity assessment. Key innovations include zero-shot cross-dataset generalization, similarity estimation for unseen classes, and text-driven image retrieval without classifiers or language models. Experiments demonstrate 12β19% improvements in zero-shot accuracy on LIP, Market-1501, and WildTrack, and achieve 78.3% mAP in cross-modal retrieval.
π Abstract
Autoencoders (AE) are simple yet powerful class of neural networks that compress data by projecting input into low-dimensional latent space (LS). Whereas LS is formed according to the loss function minimization during training, its properties and topology are not controlled directly. In this paper we focus on AE LS properties and propose two methods for obtaining LS with desired topology, called LS configuration. The proposed methods include loss configuration using a geometric loss term that acts directly in LS, and encoder configuration. We show that the former allows to reliably obtain LS with desired configuration by defining the positions and shapes of LS clusters for supervised AE (SAE). Knowing LS configuration allows to define similarity measure in LS to predict labels or estimate similarity for multiple inputs without using decoders or classifiers. We also show that this leads to more stable and interpretable training. We show that SAE trained for clothes texture classification using the proposed method generalizes well to unseen data from LIP, Market1501, and WildTrack datasets without fine-tuning, and even allows to evaluate similarity for unseen classes. We further illustrate the advantages of pre-configured LS similarity estimation with cross-dataset searches and text-based search using a text query without language models.