🤖 AI Summary
This work addresses question answering over semi-structured data—comprising both textual and relational information—where large language models (LLMs) suffer from hallucinations due to knowledge lag, and existing retrieval-augmented generation (RAG) methods struggle to model heterogeneous relational structures effectively. We propose a planning-guided LLM agent framework: it first generates interpretable information retrieval plans, then orchestrates graph/table relation parsers and multi-hop retrieval agents to jointly model and dynamically navigate cross-modal textual and relational information. This introduces the novel “planning-based RAG” paradigm, overcoming limitations of unimodal retrieval. Evaluated on cross-domain semi-structured benchmarks, our approach reduces hallucination rates by 32% and improves answer accuracy by 27%, demonstrating enhanced controllability, interpretability, and generalization across diverse data schemas.
📝 Abstract
Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.