🤖 AI Summary
This study investigates the information preservation capability of latent features extracted by visual encoders and its relationship with pretraining objectives. To address this, we propose image reconstruction fidelity as a unified quantitative metric, systematically evaluating feature richness across reconstruction-based (e.g., MAE) and contrastive learning-based (e.g., SigLIP/SigLIP2) encoders via differentiable gradient-based reconstruction, feature perturbation analysis, and encoder-agnostic inverse mapping modeling. Our analysis reveals, for the first time, that orthogonal rotations predominantly govern color representation, elucidating the geometric structure of feature space and its mapping to semantic attributes. We empirically demonstrate that reconstruction-pretrained encoders significantly outperform contrastive ones in information retention. Furthermore, we establish the first cross-architecture benchmark for quantifying and ranking the informational capacity of visual encoder features. Finally, our framework enables disentangled and controllable reconstruction of semantic attributes—such as color—supporting interpretable and editable vision representations.
📝 Abstract
Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.