🤖 AI Summary
This work addresses the opacity of semantic representations in existing vision-language models and the limitations of conventional sparse autoencoders, which rely on overcomplete expansions that distort geometric structure and introduce redundancy. The authors propose CEDAR, a novel method that—without altering the original embedding dimension—transforms pretrained embeddings into axis-aligned, disentangled representations through an invertible linear transformation, an adaptive rotation mechanism, and a top-k sparsity bottleneck. By avoiding overcompleteness, CEDAR effectively uncovers the compositional structure of embeddings, enabling high-quality concept alignment and natural language decoding in models such as CLIP and BLIP. Experiments demonstrate that CEDAR achieves an excellent trade-off between reconstruction fidelity and sparsity, yielding interpretations that are not only more human-readable but also highly consistent with human cognition.
📝 Abstract
Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.