Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the distortion of retrieval and reasoning capabilities in Long-Context Large Language Models (LCLMs) under realistic RAG scenarios. We first formally define and systematically evaluate In-Context Retrieval and Reasoning (ICR²) ability, introducing a more practical ICR² benchmark. Our methodological contributions include: (1) joint retrieval-generation fine-tuning; (2) attention-probe-based context denoising; and (3) multi-head co-training of retrieval and generation heads. Additionally, we propose a robust evaluation technique using strong retrievers to generate confounding passages, enhancing model resilience to irrelevant context. On the LOFT and ICR² benchmarks, Mistral-7B achieves +17/+15 and +13/+2 exact match (EM) improvements, respectively. Notably, our smaller models outperform GPT-4-Turbo on most tasks, demonstrating superior efficiency and effectiveness in long-context ICR².

Technology Category

Application Category

📝 Abstract

Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Information Retrieval

Complex Text Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

ICR^2 evaluation framework

joint training for retrieval and generation

attention mechanism for information filtering

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval