EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing video-language pretraining methods predominantly rely on 1D textual or 2D visual cues, lacking explicit modeling of 3D spatial structure—thus hindering deep semantic understanding in egocentric scenarios. To address this, we propose the first 3D-aware pretraining paradigm for egocentric videos: it integrates pseudo-depth maps and hand-object interaction visual cues to construct geometry-aware video representations, and introduces a lightweight 3D-aware decoder that explicitly incorporates geometric priors—a novel contribution in this domain. End-to-end optimization is achieved via depth-enhanced video-text contrastive learning. Experiments demonstrate substantial improvements over state-of-the-art methods on downstream tasks including action recognition, spatiotemporal localization, and video question answering, validating the critical role of 3D spatial understanding in egocentric video comprehension.

Technology Category

Application Category

📝 Abstract

Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at https://github.com/xuboshen/EgoDTM.

Problem

Research questions and friction points this paper is trying to address.

Bridges gap in 3D understanding from 1D text or 2D visual cues.

Introduces EgoDTM for 3D-aware egocentric video-language pretraining.

Enhances 3D-awareness using pseudo depth maps and enriched captions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoDTM integrates 3D-aware video pretraining.

Uses pseudo depth maps for 3D-awareness learning.

Enhances captions with hand-object visual cues.

🔎 Similar Papers

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?