DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised sequential disentanglement—the separation of static (content) and dynamic (motion) factors in time-series, video, and audio data without labeled supervision. Existing VAE- or GAN-based approaches suffer from cumbersome loss designs, poor generalization, and a lack of realistic evaluation protocols. To overcome these limitations, we propose DiffSDA, the first diffusion-based theoretical framework for sequential disentanglement. DiffSDA introduces a latent-variable diffusion process, a novel probabilistic modeling formulation, and an efficient sampling mechanism tailored for disentangled representation learning. Furthermore, we establish a cross-modal unified evaluation protocol grounded in real-world scenarios. Extensive experiments on multiple real-world benchmarks demonstrate that DiffSDA consistently outperforms state-of-the-art methods, validating its robustness and effectiveness in disentangling static and dynamic factors in complex temporal data.

Technology Category

Application Category

📝 Abstract
Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.
Problem

Research questions and friction points this paper is trying to address.

Separates static and dynamic factors in unlabeled sequential data
Addresses limitations of existing VAE and GAN-based disentanglement methods
Establishes evaluation protocol for real-world multimodal sequential data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages latent diffusion for sequential disentanglement
Uses probabilistic modeling and efficient samplers
Introduces modal-agnostic framework across diverse data
🔎 Similar Papers
No similar papers found.