ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image captioning methods predominantly generate generic descriptions, failing to capture event-level semantics—limiting their applicability in high-stakes domains such as news reporting and digital archiving. To address this, we propose an event-aware image captioning framework that integrates external knowledge. First, we perform two-stage news article retrieval using DINOv2 features and patch-level mutual nearest-neighbor re-ranking. Second, we introduce semantic Gaussian normalization to align temporal, social, and historical contextual embeddings with visual representations. Finally, we fuse retrieved article summaries and leverage large language models to generate factually grounded, narrative-rich event descriptions. Evaluated on the private test set of the EVENTA 2025 Challenge, our method achieves a total score of 0.54666, ranking second and demonstrating substantial improvement in cross-modal event understanding between images and text.

Technology Category

Application Category

📝 Abstract
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.
Problem

Research questions and friction points this paper is trying to address.

Generating event-enriched captions for images using contextual articles
Addressing limitations of standard vision-language models missing contextual information
Bridging visual perception with real-world knowledge for context-aware understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage article retrieval with DINOv2 embeddings
Context extraction from summaries and metadata
LLM caption generation with Semantic Gaussian Normalization