Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies that causal language models (CLMs) exhibit high sensitivity to document ordering in multi-hop question answering (MHQA) due to inherent causal masking, severely limiting cross-document reasoning. To diagnose this limitation, the authors conduct systematic context-order permutation experiments, complemented by attention weight analysis and mask mechanism disentanglement. They find that encoder-decoder architectures (e.g., Flan-T5) intrinsically possess greater robustness for multi-hop inference. Building on this insight, they propose a chain-of-reasoning–guided document reordering strategy and introduce a learnable bidirectional attention mask to enhance CLMs’ (e.g., Llama) cross-document modeling capability. Experiments demonstrate substantial accuracy improvements for CLMs on MHQA benchmarks; moreover, attention focus concentration correlates positively with answer correctness. The codebase and analytical framework are publicly released and widely adopted by the research community.

Technology Category

Application Category

📝 Abstract
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Analyzing LM performance in multi-hop QA with context permutation
Exploring impact of document order on reasoning chain effectiveness
Improving decoder-only models via bidirectional attention modification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-decoder models outperform decoder-only LMs
Document order alignment boosts reasoning performance
Bi-directional attention enhances causal decoder-only models
🔎 Similar Papers
No similar papers found.