Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised video representation learning primarily relies on high-level semantic alignment, neglecting fine-grained dynamic modeling of hand-object interactions. To address this, we propose the first hand-object dynamics-aware representation learning framework. Our method introduces: (i) an HOD data generation pipeline that integrates hand-object detection with large language model–guided fine-grained annotation; (ii) EgoVideo, a lightweight motion adapter explicitly designed for egocentric hand-object motion encoding; and (iii) a multi-stage collaborative training strategy. For the first time, our approach drives video representation learning directly through joint hand-object motion dynamics. Evaluated on EK-100 and EGTEA, it achieves zero-shot transfer improvements of +6.3%, +5.7%, and +16.3%—demonstrating substantial gains in hand-object interaction recognition and generalization to robotic manipulation tasks.

Technology Category

Application Category

📝 Abstract
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.
Problem

Research questions and friction points this paper is trying to address.

Model fine-grained hand-object dynamics in egocentric videos.
Generate high-quality narrations with detailed hand-object interactions.
Improve egocentric video representation learning for downstream tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

HOD pipeline integrates hand-object detector and language model
EgoVideo model uses lightweight motion adapter for dynamics
Co-training strategy enhances fine-grained hand-object dynamics learning
🔎 Similar Papers
No similar papers found.
Baoqi Pei
Baoqi Pei
Zhejiang University
Computer VisionMultimodal Learning
Y
Yifei Huang
Shanghai Artificial Intelligence Laboratory, The University of Tokyo
Jilan Xu
Jilan Xu
Fudan University
Computer VisionMultimodalMedical Image Analysis
G
Guo Chen
Nanjing University
Y
Yuping He
Nanjing University
L
Lijin Yang
The University of Tokyo
Y
Yali Wang
Shanghai Artificial Intelligence Laboratory, SIAT
Weidi Xie
Weidi Xie
Shanghai Jiao Tong University | VGG, University of Oxford
Computer VisionAI for HealthcareAI for Science
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory
F
Fei Wu
Zhejiang University
L
Limin Wang
Shanghai Artificial Intelligence Laboratory, Nanjing University