🤖 AI Summary
Current unsupervised representation disentanglement faces three key bottlenecks: (1) reliance on synthetic or annotated data, limiting generalization to real-world scenarios; (2) hand-crafted constraints that hinder adaptive optimization; and (3) absence of evaluation metrics suitable for unlabeled real-world data. This paper proposes the first closed-loop disentanglement framework, integrating a diffusion autoencoder with β-VAE to achieve adaptive, interpretable semantic factor disentanglement via latent distillation and diffusion-based feedback. Our contributions include: (1) introducing the first closed-loop learning paradigm for disentanglement; (2) proposing a novel, content-tracking-based evaluation metric for unsupervised disentanglement on unlabeled data; and (3) designing a self-supervised navigation strategy to identify interpretable semantic directions in latent space. Extensive experiments on real-image editing and visual analysis tasks demonstrate significant improvements over state-of-the-art methods, validating both the generalizability and practical utility of unsupervised disentanglement in natural scenes.
📝 Abstract
Representation disentanglement may help AI fundamentally understand the real world and thus benefit both discrimination and generation tasks. It currently has at least three unresolved core issues: (i) heavy reliance on label annotation and synthetic data -- causing poor generalization on natural scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to adaptively achieve an optimal training trade-off; (iii) lacking reasonable evaluation metric, especially for the real label-free data. To address these challenges, we propose a extbf{C}losed- extbf{L}oop unsupervised representation extbf{Dis}entanglement approach dubbed extbf{CL-Dis}. Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while resorting to $eta$-VAE as a co-pilot to extract semantically disentangled representations. The strong generation ability of diffusion model and the good disentanglement ability of VAE model are complementary. To strengthen disentangling, VAE-latent distillation and diffusion-wise feedback are interconnected in a closed-loop system for a further mutual promotion. Then, a self-supervised extbf{Navigation} strategy is introduced to identify interpretable semantic directions in the disentangled latent space. Finally, a new metric based on content tracking is designed to evaluate the disentanglement effect. Experiments demonstrate the superiority of CL-Dis on applications like real image manipulation and visual analysis.