🤖 AI Summary
This study addresses performance bottlenecks in retrieval-augmented generation (RAG) systems under dynamic test sets—specifically, low answer relevance and faithfulness. We propose a hybrid retrieval framework integrating BM25 (sparse) and E5 (dense) retrievers, augmented by RankLLaMA neural re-ranking and DSPy-based prompt optimization. Key contributions include: (i) the first empirical identification of lexical alignment as a critical predictor of semantic matching; and (ii) the discovery that prompt optimization improves semantic similarity but risks inducing model overconfidence, necessitating careful trade-off design. Experiments show that neural re-ranking boosts mean average precision (MAP) by 52%; our system ranks 4th in faithfulness and 11th in correctness among 25 competing teams. However, inference latency increases proportionally. Overall, this work provides a reproducible methodology for enhancing robustness and trustworthiness of RAG systems in dynamic settings.
📝 Abstract
We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.