VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenges of event detection in capsule endoscopy, where lesions are sparse, visually heterogeneous, and require event-level evaluation. To tackle these issues, the authors propose VISTA, a novel framework that integrates dual backbones—EndoFM-LV and DINOv3 ViT-L/16—with a diverse head ensemble, a validation-guided weighted fusion mechanism, and an anatomy-aware temporal event decoding strategy. Additionally, VISTA employs global coarse search to optimize local thresholds, substantially improving detection accuracy for rare pathological events. Evaluated on the RAREVISION post-challenge task, VISTA achieves a temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431, securing second place and demonstrating its effectiveness and state-of-the-art performance.

📝 Abstract

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

Problem

Research questions and friction points this paper is trying to address.

capsule endoscopy

rare-pathology detection

event-level evaluation

visual heterogeneity

temporal event detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models

validation-guided fusion

anatomy-aware decoding