🤖 AI Summary
This work addresses the challenge of instruction-guided editing for egocentric videos in interactive augmented reality, where severe ego-motion and frequent hand-object interactions cause poor domain adaptation and high offline editing latency. To tackle this, we introduce EgoEditData—the first dedicated dataset for this task—design EgoEdit, a lightweight end-to-end streaming editing model enabling real-time inference on a single GPU, and establish EgoEditBench, a comprehensive benchmark evaluating instruction fidelity, hand preservation, and motion stability. Our method integrates multimodal instruction alignment, temporal stability optimization, and hand-region protection. Experiments demonstrate that our approach significantly outperforms existing methods on egocentric video editing, achieving low-latency real-time interaction, while matching state-of-the-art performance on general-purpose video editing tasks—thereby advancing the practical deployment of egocentric video editing.
📝 Abstract
We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit