Post-Hoc Concept Disentanglement: From Correlated to Isolated Concept Representations

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Concept Activation Vectors (CAVs) often yield non-orthogonal, entangled directions for semantically related concepts (e.g., “mustache” and “tie” in CelebA), impairing interpretability and undermining causal interventions. To address this, we propose a posterior concept disentanglement framework. Our key innovation is the first introduction of a non-orthogonality loss, which enforces orthogonality among CAVs while preserving semantic fidelity of the original concept directions. Coupled with posterior CAV optimization, our method achieves concept isolation within VGG16 and ResNet18 feature spaces. Experiments on CelebA and FunnyBirds demonstrate that generative models can inject individual concepts in isolation; in shortcut suppression tasks, interference from confounding concepts decreases by 42% over baseline CAVs, significantly improving both accuracy and independence in concept manipulation.

Technology Category

Application Category

📝 Abstract

Concept Activation Vectors (CAVs) are widely used to model human-understandable concepts as directions within the latent space of neural networks. They are trained by identifying directions from the activations of concept samples to those of non-concept samples. However, this method often produces similar, non-orthogonal directions for correlated concepts, such as"beard"and"necktie"within the CelebA dataset, which frequently co-occur in images of men. This entanglement complicates the interpretation of concepts in isolation and can lead to undesired effects in CAV applications, such as activation steering. To address this issue, we introduce a post-hoc concept disentanglement method that employs a non-orthogonality loss, facilitating the identification of orthogonal concept directions while preserving directional correctness. We evaluate our approach with real-world and controlled correlated concepts in CelebA and a synthetic FunnyBirds dataset with VGG16 and ResNet18 architectures. We further demonstrate the superiority of orthogonalized concept representations in activation steering tasks, allowing (1) the insertion of isolated concepts into input images through generative models and (2) the removal of concepts for effective shortcut suppression with reduced impact on correlated concepts in comparison to baseline CAVs.

Problem

Research questions and friction points this paper is trying to address.

Addresses entanglement of correlated concepts in neural networks.

Introduces method to isolate concept representations post-training.

Improves concept manipulation in activation steering tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-hoc disentanglement using non-orthogonality loss

Orthogonal concept directions for isolated representations

Enhanced activation steering with generative models

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings