SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of imitation error detection between asynchronous and length-mismatched first-person (ego) and third-person (exo) videos. To tackle cross-view domain shift, temporal misalignment, and redundancy, the authors propose the SAVA-X framework, which integrates viewpoint-conditional adaptive sampling, scene-adaptive viewpoint embedding, and a bidirectional cross-attention fusion mechanism to enable fine-grained error localization and discrimination. By unifying dense video captioning and temporal action detection into a single framework, SAVA-X achieves significant improvements in AUPRC and mean tIoU on the EgoMe benchmark. Ablation studies further confirm the effectiveness and complementarity of each component within the proposed architecture.

Technology Category

Application Category

📝 Abstract
Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.
Problem

Research questions and friction points this paper is trying to address.

Ego-to-Exo Imitation
Error Detection
Cross-View Alignment
Temporal Misalignment
Domain Shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view alignment
scene-adaptive embedding
bidirectional cross-attention
imitation error detection
ego-exo video understanding
🔎 Similar Papers
No similar papers found.
X
Xiang Li
University of Electronic Science and Technology of China
Heqian Qiu
Heqian Qiu
University of Electronic Science and Technology of China, UESTC
Object DetectionMultimodal
L
Lanxiao Wang
University of Electronic Science and Technology of China
B
Benliu Qiu
University of Electronic Science and Technology of China
F
Fanman Meng
University of Electronic Science and Technology of China
Linfeng Xu
Linfeng Xu
University of Electronic Science and Technology of China
Hongliang Li
Hongliang Li
Professor of UESTC
Computer VisionMultimedia Porocessing