🤖 AI Summary
This work addresses unsupervised cross-view dense video captioning, aiming to generate fine-grained temporal segmentation and activity descriptions for unlabeled target-view videos (e.g., egocentric) using only annotated source-view data (e.g., exocentric). To tackle substantial inter-view discrepancies arising from temporal misalignment and irrelevant objects, we propose the first gaze-consensus-guided cross-view adaptation framework. It integrates gaze modeling, score-based adversarial learning (SALM), and a gaze consensus construction module (GCCM), jointly optimized via a hierarchical gaze consistency loss that enforces spatiotemporal alignment and representation invariance. We formally define the unsupervised Ego-Exo dense captioning task and introduce the first dedicated benchmark, EgoMe-UEA-DVC. Experiments demonstrate that our method significantly outperforms existing approaches on this benchmark, achieving high-precision temporal localization and semantically rich descriptions. The code will be made publicly available.
📝 Abstract
Even from an early age, humans naturally adapt between exocentric (Exo) and egocentric (Ego) perspectives to understand daily procedural activities. Inspired by this cognitive ability, in this paper, we propose a novel Unsupervised Ego-Exo Adaptation for Dense Video Captioning (UEA-DVC) task, which aims to predict the time segments and descriptions for target view videos, while only the source view data are labeled during training. Despite previous works endeavoring to address the fully-supervised single-view or cross-view dense video captioning, they lapse in the proposed unsupervised task due to the significant inter-view gap caused by temporal misalignment and irrelevant object interference. Hence, we propose a Gaze Consensus-guided Ego-Exo Adaptation Network (GCEAN) that injects the gaze information into the learned representations for the fine-grained alignment between the Ego and Exo views. Specifically, the Score-based Adversarial Learning Module (SALM) incorporates a discriminative scoring network to learn unified view-invariant representations for bridging distinct views from a global level. Then, the Gaze Consensus Construction Module (GCCM) utilizes gaze representations to progressively calibrate the learned global view-invariant representations for extracting the video temporal contexts based on focusing regions. Moreover, the gaze consensus is constructed via hierarchical gaze-guided consistency losses to spatially and temporally align the source and target views. To support our research, we propose a new EgoMe-UEA-DVC benchmark and experiments demonstrate the effectiveness of our method, which outperforms many related methods by a large margin. The code will be released.