🤖 AI Summary
This work addresses the significant performance degradation of existing retrieval-augmented generation (RAG) systems in multilingual multi-hop question answering, which stems from the lack of effective benchmarks and overreliance on English-centric semantic understanding. To bridge this gap, the authors introduce the first multilingual multi-hop QA benchmark and propose the DaPT framework, which concurrently generates sub-question graphs in both the source language and English, then integrates them through a bilingual retrieval and answer generation strategy to iteratively solve complex queries. By synergizing multilingual sub-question decomposition, bilingual retrieval-augmented generation, and cross-lingual semantic alignment, the method achieves an average 18.3% relative improvement in Exact Match (EM) score over strong baselines on benchmarks such as MuSiQue, yielding answers that are not only more accurate but also more concise.
📝 Abstract
Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.