🤖 AI Summary
This work addresses the distortion of retrieval and reasoning capabilities in Long-Context Large Language Models (LCLMs) under realistic RAG scenarios. We first formally define and systematically evaluate In-Context Retrieval and Reasoning (ICR²) ability, introducing a more practical ICR² benchmark. Our methodological contributions include: (1) joint retrieval-generation fine-tuning; (2) attention-probe-based context denoising; and (3) multi-head co-training of retrieval and generation heads. Additionally, we propose a robust evaluation technique using strong retrievers to generate confounding passages, enhancing model resilience to irrelevant context. On the LOFT and ICR² benchmarks, Mistral-7B achieves +17/+15 and +13/+2 exact match (EM) improvements, respectively. Notably, our smaller models outperform GPT-4-Turbo on most tasks, demonstrating superior efficiency and effectiveness in long-context ICR².
📝 Abstract
Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.