Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing video representations lack sensitivity to temporal directionality, hindering discrimination of chiral actions—e.g., “opening” vs. “closing” a door—that are identical in appearance but reversed in time order. To address this, we introduce *chiral action recognition* as a new benchmark task and propose a *straightening-aware latent-space inductive bias*, explicitly endowing frozen image-feature sequences with temporal directionality. Methodologically, we design a self-supervised autoencoder-based adaptation framework that jointly incorporates geometric priors in the latent space and contrastive learning to achieve temporal disentanglement and dynamic enhancement of frame features. Our approach significantly outperforms large-scale pretrained video models on Something-Something, EPIC-Kitchens, and Charades. Crucially, its representations enable linear separability of chiral actions and yield plug-and-play improvements on downstream benchmarks. Our core contribution is the first formulation of temporal chirality as a learnable geometric constraint—establishing a novel paradigm for fine-grained temporal understanding.

Technology Category

Application Category

📝 Abstract

Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as "opening vs. closing a door", "approaching vs. moving away from something", "folding vs. unfolding paper", etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.

Problem

Research questions and friction points this paper is trying to address.

Develop time-sensitive video representations for chiral action recognition

Distinguish temporally opposite actions like opening vs closing

Improve linear separability in video embeddings for chiral pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised adaptation for time-sensitive video representation

Auto-encoder with perceptual straightening inductive bias

Linear separability of chiral action pairs in latent space

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding