Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the challenge of retrieving scientific literature implicitly referenced in social media posts. We propose a three-stage hybrid retrieval framework: (1) a dual-path dense retrieval layer combining BM25 and a fine-tuned INF-Retriever-v1; (2) an efficient vector indexing layer built upon FAISS; and (3) a re-ranking layer using an open-source LLM-based cross-encoder. The entire pipeline runs locally without external training data, ensuring reproducibility and practicality. To our knowledge, this is the first work to synergistically integrate lightweight dense retrieval with large language model–based re-ranking for semantic alignment between informal social media language and formal academic texts, effectively bridging their lexical and conceptual gaps. On the official benchmark, our approach achieves 76.46% MRR@5 on the development set (rank #1) and 66.43% on the blind test set (rank #3 among 31 teams), approaching the performance of the top-performing system.

Technology Category

Application Category

📝 Abstract

We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition, which focuses on retrieving relevant scientific literature for given social media posts. To address this task, we propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking, enabling robust retrieval that bridges the informal-to-formal language gap. Specifically, we combine BM25-based keyword matching with a FAISS vector store using a fine-tuned INF-Retriever-v1 model for dense semantic retrieval. BM25 returns the top 30 candidates, and semantic search yields 100 candidates, which are then merged and re-ranked via a large language model (LLM)-based cross-encoder. Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set, securing the 1st position on the development leaderboard and ranking 3rd on the test leaderboard (out of 31 teams), with a relative performance gap of only 2 percentage points compared to the top-ranked system. We achieve this strong performance by running open-source models locally and without external training data, highlighting the effectiveness of a carefully designed and fine-tuned retrieval pipeline.

Problem

Research questions and friction points this paper is trying to address.

Retrieving scientific papers from social media mentions

Bridging informal-to-formal language gap in retrieval

Combining lexical and semantic methods for robust results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid retrieval combining lexical and semantic methods

Fine-tuned INF-Retriever-v1 for dense semantic retrieval

LLM-based cross-encoder for final re-ranking

🔎 Similar Papers

Can tweets predict article retractions? A comparison between human and LLM labelling