Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback

📅 2024-02-04

🏛️ European Conference on Computer Vision

📈 Citations: 9

✨ Influential: 0

🤖 AI Summary

Current unsupervised representation disentanglement faces three key bottlenecks: (1) reliance on synthetic or annotated data, limiting generalization to real-world scenarios; (2) hand-crafted constraints that hinder adaptive optimization; and (3) absence of evaluation metrics suitable for unlabeled real-world data. This paper proposes the first closed-loop disentanglement framework, integrating a diffusion autoencoder with β-VAE to achieve adaptive, interpretable semantic factor disentanglement via latent distillation and diffusion-based feedback. Our contributions include: (1) introducing the first closed-loop learning paradigm for disentanglement; (2) proposing a novel, content-tracking-based evaluation metric for unsupervised disentanglement on unlabeled data; and (3) designing a self-supervised navigation strategy to identify interpretable semantic directions in latent space. Extensive experiments on real-image editing and visual analysis tasks demonstrate significant improvements over state-of-the-art methods, validating both the generalizability and practical utility of unsupervised disentanglement in natural scenes.

Technology Category

Application Category

📝 Abstract

Representation disentanglement may help AI fundamentally understand the real world and thus benefit both discrimination and generation tasks. It currently has at least three unresolved core issues: (i) heavy reliance on label annotation and synthetic data -- causing poor generalization on natural scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to adaptively achieve an optimal training trade-off; (iii) lacking reasonable evaluation metric, especially for the real label-free data. To address these challenges, we propose a extbf{C}losed- extbf{L}oop unsupervised representation extbf{Dis}entanglement approach dubbed extbf{CL-Dis}. Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while resorting to $eta$-VAE as a co-pilot to extract semantically disentangled representations. The strong generation ability of diffusion model and the good disentanglement ability of VAE model are complementary. To strengthen disentangling, VAE-latent distillation and diffusion-wise feedback are interconnected in a closed-loop system for a further mutual promotion. Then, a self-supervised extbf{Navigation} strategy is introduced to identify interpretable semantic directions in the disentangled latent space. Finally, a new metric based on content tracking is designed to evaluate the disentanglement effect. Experiments demonstrate the superiority of CL-Dis on applications like real image manipulation and visual analysis.

Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization in representation disentanglement on natural data

Solves heuristic constraints preventing optimal training trade-off adaptation

Develops evaluation metric for unsupervised disentanglement without labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop system with VAE distillation and diffusion feedback

Self-supervised navigation for semantic direction identification

Content-tracking metric for disentanglement evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow