Unsupervised Ego- and Exo-centric Dense Procedural Activity Captioning via Gaze Consensus Adaptation

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised cross-view dense video captioning, aiming to generate fine-grained temporal segmentation and activity descriptions for unlabeled target-view videos (e.g., egocentric) using only annotated source-view data (e.g., exocentric). To tackle substantial inter-view discrepancies arising from temporal misalignment and irrelevant objects, we propose the first gaze-consensus-guided cross-view adaptation framework. It integrates gaze modeling, score-based adversarial learning (SALM), and a gaze consensus construction module (GCCM), jointly optimized via a hierarchical gaze consistency loss that enforces spatiotemporal alignment and representation invariance. We formally define the unsupervised Ego-Exo dense captioning task and introduce the first dedicated benchmark, EgoMe-UEA-DVC. Experiments demonstrate that our method significantly outperforms existing approaches on this benchmark, achieving high-precision temporal localization and semantically rich descriptions. The code will be made publicly available.

Technology Category

Application Category

📝 Abstract
Even from an early age, humans naturally adapt between exocentric (Exo) and egocentric (Ego) perspectives to understand daily procedural activities. Inspired by this cognitive ability, in this paper, we propose a novel Unsupervised Ego-Exo Adaptation for Dense Video Captioning (UEA-DVC) task, which aims to predict the time segments and descriptions for target view videos, while only the source view data are labeled during training. Despite previous works endeavoring to address the fully-supervised single-view or cross-view dense video captioning, they lapse in the proposed unsupervised task due to the significant inter-view gap caused by temporal misalignment and irrelevant object interference. Hence, we propose a Gaze Consensus-guided Ego-Exo Adaptation Network (GCEAN) that injects the gaze information into the learned representations for the fine-grained alignment between the Ego and Exo views. Specifically, the Score-based Adversarial Learning Module (SALM) incorporates a discriminative scoring network to learn unified view-invariant representations for bridging distinct views from a global level. Then, the Gaze Consensus Construction Module (GCCM) utilizes gaze representations to progressively calibrate the learned global view-invariant representations for extracting the video temporal contexts based on focusing regions. Moreover, the gaze consensus is constructed via hierarchical gaze-guided consistency losses to spatially and temporally align the source and target views. To support our research, we propose a new EgoMe-UEA-DVC benchmark and experiments demonstrate the effectiveness of our method, which outperforms many related methods by a large margin. The code will be released.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised adaptation between ego-exo views for dense video captioning
Addressing inter-view gap due to temporal misalignment and object interference
Using gaze consensus to align and calibrate view-invariant representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Ego-Exo adaptation for captioning
Gaze Consensus-guided view alignment network
Score-based adversarial learning for view invariance
🔎 Similar Papers
No similar papers found.
Z
Zhaofeng Shi
University of Electronic Science and Technology of China
Heqian Qiu
Heqian Qiu
University of Electronic Science and Technology of China, UESTC
Object DetectionMultimodal
L
Lanxiao Wang
University of Electronic Science and Technology of China
Qingbo Wu
Qingbo Wu
University of Electronic Science and Technology of China
video codingimage and video quality assessment
F
Fanman Meng
University of Electronic Science and Technology of China
H
Hongliang Li
University of Electronic Science and Technology of China