Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weak generalization and susceptibility to distractors plague large language models (LLMs) in long-chain reasoning tasks. To address this, we propose MemReasoner—a lightweight, explicit memory–augmented reasoning architecture. Its core innovations include: (i) relative-order modeling and skip-based memory access for structured factual representation; and (ii) selective memory attention integrated with end-to-end joint training. MemReasoner achieves strong generalization with only 0–1% fact-level supervision. On our novel multi-hop “haystack reasoning” benchmark—designed to stress-test robustness against long distractor passages and answer perturbations—it outperforms fully supervised baselines by over 40 percentage points in both single- and two-hop accuracy. Crucially, it maintains high robustness under adversarial interference. Our analysis reveals a key mechanism: the synergistic interplay between weak supervision and structured memory significantly enhances reasoning robustness and generalization.

Technology Category

Application Category

📝 Abstract
Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning in large language models using memory-augmented architecture.
Improve generalization in multi-hop reasoning tasks with minimal supervision.
Address brittleness in long-context reasoning tasks with selective memory attention.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-augmented LLM architecture for reasoning tasks
End-to-end training with optional fact supervision
Generalization using none-to-weak supporting fact supervision
🔎 Similar Papers
No similar papers found.
Payel Das
Payel Das
Manager and Principal Research Staff Member, AI research, IBM Watson, NY
trustworthy MLgenerative AIbio-inspired AIAI4Sciencestatistical physics
C
Ching-Yun Ko
IBM AI Research
S
Sihui Dai
Princeton University (Work done during internship at IBM Research)
G
Georgios Kollias
IBM AI Research
Subhajit Chaudhury
Subhajit Chaudhury
Senior Research Scientist, IBM Research
Neuro-Symbolic AIReinforcement learningTrustworthy AIComputer Vision
A
Aurélie C. Lozano
IBM AI Research