🤖 AI Summary
This study investigates the invertibility of intermediate representations in Transformer-based vision models—specifically ViT and Deformable DETR—to uncover mechanistic differences in shape modeling, detail preservation, inter-layer correlation, and color robustness. We propose a unified modular inverse modeling framework that reconstructs images efficiently from multi-layer features via feature-space projection and lightweight inverse networks—the first such approach enabling cross-layer reconstruction. Quantitative evaluation (PSNR/SSIM), visual analysis, and controlled color perturbation experiments reveal: (1) shallow layers better preserve textural details, while deeper layers encode semantic shape; (2) ViT exhibits stronger inter-layer feature correlation, whereas Deformable DETR demonstrates superior robustness to color variations. Our work provides a novel invertibility-centric perspective and systematic empirical evidence for understanding how Transformer vision models structure visual representations.
📝 Abstract
Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many previous approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer, showing that this approach is efficient and feasible. Through qualitative and quantitative evaluations of reconstructed images, we generate insights into the underlying mechanisms of these architectures, highlighting their similarities and differences in terms of contextual shape and preservation of image details, inter-layer correlation, and robustness to color perturbations. Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models. The code for reproducing our experiments is available at github.com/wiskott-lab/inverse-detection-transformer.