🤖 AI Summary
This work addresses the challenge of imitation error detection between asynchronous and length-mismatched first-person (ego) and third-person (exo) videos. To tackle cross-view domain shift, temporal misalignment, and redundancy, the authors propose the SAVA-X framework, which integrates viewpoint-conditional adaptive sampling, scene-adaptive viewpoint embedding, and a bidirectional cross-attention fusion mechanism to enable fine-grained error localization and discrimination. By unifying dense video captioning and temporal action detection into a single framework, SAVA-X achieves significant improvements in AUPRC and mean tIoU on the EgoMe benchmark. Ablation studies further confirm the effectiveness and complementarity of each component within the proposed architecture.
📝 Abstract
Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.