LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes LUCID, the first unified vision-language sparse autoencoder that addresses the limitations of modality-specific training in existing approaches, which often yield uninterpretable features and hinder cross-modal alignment. LUCID employs a shared-private representation architecture to learn a common latent dictionary across modalities while preserving modality-specific characteristics. Unsupervised feature alignment is achieved through optimal transport, enabling pixel-level localization and cross-modal neuron correspondence. This design substantially mitigates concept clustering issues and enhances interpretability. Furthermore, LUCID introduces an automated dictionary interpretation pipeline based on term clustering. The learned shared features encompass objects, actions, attributes, and abstract concepts, significantly improving both the interpretability and robustness of multimodal representations.

Technology Category

Application Category

📝 Abstract
Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.
Problem

Research questions and friction points this paper is trying to address.

sparse autoencoders
interpretable concept discovery
vision-language alignment
cross-modal representation
shared latent dictionary
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoder
vision-language alignment
interpretable representation
optimal transport
cross-modal concept discovery
🔎 Similar Papers
No similar papers found.