🤖 AI Summary
This work addresses the challenge of detecting rare pathological events in capsule endoscopy videos, which is hindered by lesion sparsity, high visual heterogeneity, and the requirement for event-level evaluation. The authors formulate this task as an indicator-aligned event detection problem and propose a hierarchical architecture that integrates a local temporal model (EndoFM-LV) with a global visual model (DINOv3 ViT-L/16). They introduce a novel validation-guided model weighting fusion mechanism and an anatomy-constrained temporal decoding strategy to enable complementary multi-model collaboration and optimize event-level performance. On the official hidden test set, the method achieves a temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235, demonstrating a significant improvement in detection accuracy.
📝 Abstract
Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.