EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing methods for event image retrieval from free-text descriptions struggle to comprehend abstract event semantics, implicit causal relationships, and long-range temporal context. To address this, we propose a multi-stage retrieval framework: (1) coarse-grained article-level retrieval using Qwen3; (2) event-aware contextual alignment reranking via Qwen3-Reranker; and (3) fine-grained cross-modal semantic matching and image scoring with Qwen2-VL. Crucially, we introduce a Reciprocal Rank Fusion (RRF) strategy to integrate outputs from multiple model configurations, thereby enhancing representational capacity for complex events. Evaluated on the private test set of the EVENTA 2025 Grand Challenge Track 2, our method achieves first place, demonstrating substantial improvements in retrieving images aligned with abstract events, causal logic, and intricate narrative structures.

Technology Category

Application Category

📝 Abstract

Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

Problem

Research questions and friction points this paper is trying to address.

Retrieving images from free-form captions with event semantics

Overcoming limitations of conventional vision-language retrieval methods

Addressing abstract events and complex narratives in multimodal retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage retrieval framework with event-aware reranking

Leverages Qwen models for article search and image scoring

Fuses outputs using Reciprocal Rank Fusion for robustness

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs