SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of imitation error detection between asynchronous and length-mismatched first-person (ego) and third-person (exo) videos. To tackle cross-view domain shift, temporal misalignment, and redundancy, the authors propose the SAVA-X framework, which integrates viewpoint-conditional adaptive sampling, scene-adaptive viewpoint embedding, and a bidirectional cross-attention fusion mechanism to enable fine-grained error localization and discrimination. By unifying dense video captioning and temporal action detection into a single framework, SAVA-X achieves significant improvements in AUPRC and mean tIoU on the EgoMe benchmark. Ablation studies further confirm the effectiveness and complementarity of each component within the proposed architecture.

Technology Category

Application Category

📝 Abstract

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

Problem

Research questions and friction points this paper is trying to address.

Ego-to-Exo Imitation

Error Detection

Cross-View Alignment

Temporal Misalignment

Domain Shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view alignment

scene-adaptive embedding

bidirectional cross-attention