🤖 AI Summary
This work addresses the problem of automatically learning and disentangling group actions in high-dimensional image latent spaces. We propose an end-to-end framework that dynamically separates transformation-sensitive and transformation-invariant features via a learnable binary mask—eliminating the need for manual specification of equivariant/invariant subspaces—and jointly optimizes the mask and representation using straight-through estimation. Crucially, our method is the first to directly learn the structure of group actions on the latent manifold, unifying representation disentanglement and group transformation mapping within a standard encoder-decoder architecture. Experiments across five 2D and 3D image datasets demonstrate that the approach automatically discovers group-action-aware disentangled representations, yielding significant improvements in downstream classification accuracy. These results validate its effectiveness and generalizability for controllable generation and interpretable representation learning.
📝 Abstract
Modeling group actions on latent representations enables controllable transformations of high-dimensional image data. Prior works applying group-theoretic priors or modeling transformations typically operate in the high-dimensional data space, where group actions apply uniformly across the entire input, making it difficult to disentangle the subspace that varies under transformations. While latent-space methods offer greater flexibility, they still require manual partitioning of latent variables into equivariant and invariant subspaces, limiting the ability to robustly learn and operate group actions within the representation space. To address this, we introduce a novel end-to-end framework that for the first time learns group actions on latent image manifolds, automatically discovering transformation-relevant structures without manual intervention. Our method uses learnable binary masks with straight-through estimation to dynamically partition latent representations into transformation-sensitive and invariant components. We formulate this within a unified optimization framework that jointly learns latent disentanglement and group transformation mappings. The framework can be seamlessly integrated with any standard encoder-decoder architecture. We validate our approach on five 2D/3D image datasets, demonstrating its ability to automatically learn disentangled latent factors for group actions in diverse data, while downstream classification tasks confirm the effectiveness of the learned representations. Our code is publicly available at https://github.com/farhanaswarnali/Learning-Group-Actions-In-Disentangled-Latent-Image-Representations .