Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of learning disentangled representations from real-world sequential data characterized by complex, coupled factors, this paper introduces the first benchmark platform encompassing six diverse datasets across video, audio, and time-series modalities. Methodologically: (1) it employs Koopman operator theory to model temporal dependencies, transcending the conventional static-dynamic dichotomy; (2) it incorporates a latent-variable exploration phase enabling automatic semantic factor alignment; and (3) it integrates vision-language models to support zero-shot automatic annotation and unsupervised evaluation. Contributions include: the first multimodal, multi-factor disentanglement benchmark; a scalable, modular modeling paradigm; and an end-to-end disentangled learning and evaluation pipeline requiring no manual annotations. Experiments demonstrate state-of-the-art performance in multi-factor disentanglement quality, cross-modal generalization, and interpretability.

Technology Category

Application Category

📝 Abstract
Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement.
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-factor sequential disentanglement in real-world data
Overcoming limitations of static and dynamic two-factor approaches
Providing standardized benchmarks and tools for evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced first multi-factor sequential disentanglement benchmark
Proposed Koopman-inspired model achieving state-of-the-art performance
Used Vision-Language Models for automated annotation and evaluation
🔎 Similar Papers
No similar papers found.
T
Tal Barami
Faculty of Computer and Information Science, Ben-Gurion University of The Negev
Nimrod Berman
Nimrod Berman
Ben Gurion University
Deep Learning
Ilan Naiman
Ilan Naiman
PhD Student, Computer Science at Ben Gurion University
Deep Learning
A
Amos H. Hason
Faculty of Computer and Information Science, Ben-Gurion University of The Negev
R
Rotem Ezra
Faculty of Computer and Information Science, Ben-Gurion University of The Negev
Omri Azencot
Omri Azencot
Senior Lecturer (Assistant Professor) of Computer Science, BGU
Machine LearningRepresentation LearningGenerative ModelingSequential Modeling