DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses unsupervised sequential disentanglement—the separation of static (content) and dynamic (motion) factors in time-series, video, and audio data without labeled supervision. Existing VAE- or GAN-based approaches suffer from cumbersome loss designs, poor generalization, and a lack of realistic evaluation protocols. To overcome these limitations, we propose DiffSDA, the first diffusion-based theoretical framework for sequential disentanglement. DiffSDA introduces a latent-variable diffusion process, a novel probabilistic modeling formulation, and an efficient sampling mechanism tailored for disentangled representation learning. Furthermore, we establish a cross-modal unified evaluation protocol grounded in real-world scenarios. Extensive experiments on multiple real-world benchmarks demonstrate that DiffSDA consistently outperforms state-of-the-art methods, validating its robustness and effectiveness in disentangling static and dynamic factors in complex temporal data.

Technology Category

Application Category

📝 Abstract

Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.

Problem

Research questions and friction points this paper is trying to address.

Separates static and dynamic factors in unlabeled sequential data

Addresses limitations of existing VAE and GAN-based disentanglement methods

Establishes evaluation protocol for real-world multimodal sequential data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages latent diffusion for sequential disentanglement

Uses probabilistic modeling and efficient samplers

Introduces modal-agnostic framework across diverse data

🔎 Similar Papers

Completed Feature Disentanglement Learning for Multimodal MRIs Analysis