Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses three core clinical tasks in otolaryngology (ENT) endoscopic image analysis: image classification, image retrieval, and text-to-image retrieval. We propose a unified vision-language framework built upon the CLIP ViT-B/16 architecture. Methodologically, we introduce multi-level [CLS] token fusion, spherical feature interpolation, class-specific natural language prompts, and low-rank adaptation to achieve robust cross-modal semantic alignment and strong generalization under few-shot settings; the model is jointly optimized via contrastive learning and supervised classification objectives. Evaluated on the ACM MM’25 ENTRep Challenge, our method achieves 95% classification accuracy and F1-score, Recall@1 of 0.93 (image retrieval) and 0.92 (text-to-image retrieval), and mean reciprocal rank (MRR) of 0.97 and 0.96, respectively—substantially outperforming all baselines. These results validate the framework’s clinical applicability and superiority in multimodal representation learning.

Technology Category

Application Category

📝 Abstract

We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM'25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.

Problem

Research questions and friction points this paper is trying to address.

Classifying endoscopy images with limited medical data

Improving cross-modal alignment between images and text

Enhancing retrieval accuracy for medical image analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low-Rank Adaptation for efficient medical fine-tuning

Multi-level CLS token aggregation for representation diversity

Class-specific natural language prompts for cross-modal alignment

🔎 Similar Papers

Enhancing medical vision-language contrastive learning via inter-matching relation modelling