ACM Multimedia Grand Challenge on ENT Endoscopy Analysis

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated analysis of otolaryngology (ENT) endoscopic images has long been hindered by inter-device and inter-operator variability, subtle and localized pathologies, and fine-grained discrimination challenges—e.g., left/right laterality and vocal fold status. Moreover, existing benchmarks lack support for cross-modal similar-case retrieval (enabling joint visual + bilingual textual queries). To address these gaps, we introduce the first large-scale, bilingual (Chinese–English), clinically supervised ENT endoscopy dataset. We propose a unified framework integrating anatomical region–level fine-grained classification with cross-modal retrieval (image–image and text–image). We define three standardized benchmark tasks. Rigorously validated via expert annotation, server-side blind evaluation, and an international challenge, our work establishes a reproducible, clinically interpretable, and multimodal evaluation ecosystem—advancing intelligent ENT diagnosis toward clinical trustworthiness and interactive multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.
Problem

Research questions and friction points this paper is trying to address.

Automated analysis of ENT endoscopy imagery is underdeveloped
Lack of public benchmarks for case retrieval and classification
Need for fine-grained anatomical classification and bilingual retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained anatomical classification integration
Bilingual image-text retrieval support
Standardized benchmark tasks protocol
🔎 Similar Papers
No similar papers found.
Trong-Thuan Nguyen
Trong-Thuan Nguyen
University of Science, VNU-HCM
Deep LearningComputer VisionVideo Understanding
Viet-Tham Huynh
Viet-Tham Huynh
Researcher at Software Engineering Laboratory, University of Science, VNU-HCM
Software EngineeringVirtual RealityComputer Vision
Thao Thi Phuong Dao
Thao Thi Phuong Dao
Otolaryngologist at Thong Nhat Hospital, Ho Chi Minh City, Vietnam.
Otolaryngology-Head and Neck surgeryComputer VisionComputer-aided DiagnosisMedical Image Analysis
H
Ha Nguyen Thi
Thong Nhat Hospital, Ho Chi Minh, Vietnam
T
Tien To Vu Thuy
Faculty of Medicine, Pham Ngoc Thach University of Medicine, Ho Chi Minh, Vietnam
U
Uyen Hanh Tran
Cho Ray Hospital, Ho Chi Minh, Vietnam
T
Tam V. Nguyen
University of Dayton, Ohio, United States
T
Thanh Dinh Le
University of Health Sciences, VNU-HCM, Vietnam National University, Ho Chi Minh, Vietnam; Thong Nhat Hospital, Ho Chi Minh, Vietnam
Minh-Triet Tran
Minh-Triet Tran
University of Science & John von Neumann Institute, VNU-HCM
Cryptography and SecurityMultimedia and InteractionComputer Vision and Machine LearningSoftware Engineering