EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of instruction-guided editing for egocentric videos in interactive augmented reality, where severe ego-motion and frequent hand-object interactions cause poor domain adaptation and high offline editing latency. To tackle this, we introduce EgoEditData—the first dedicated dataset for this task—design EgoEdit, a lightweight end-to-end streaming editing model enabling real-time inference on a single GPU, and establish EgoEditBench, a comprehensive benchmark evaluating instruction fidelity, hand preservation, and motion stability. Our method integrates multimodal instruction alignment, temporal stability optimization, and hand-region protection. Experiments demonstrate that our approach significantly outperforms existing methods on egocentric video editing, achieving low-latency real-time interaction, while matching state-of-the-art performance on general-purpose video editing tasks—thereby advancing the practical deployment of egocentric video editing.

Technology Category

Application Category

📝 Abstract

We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

Problem

Research questions and friction points this paper is trying to address.

Develops a dataset for egocentric video editing with hand-object interactions

Creates a real-time streaming model for low-latency egocentric video editing

Introduces a benchmark to evaluate egocentric video editing performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for egocentric editing with hand-object interactions

Real-time streaming inference model on a single GPU

Benchmark suite for evaluating instruction faithfulness and stability

🔎 Similar Papers

MM-Ego: Towards Building Egocentric Multimodal LLMs

2024-10-09arXiv.orgCitations: 12