Question-to-Question Retrieval for Hallucination-Free Knowledge Access: An Approach for Wikipedia and Wikidata Question Answering

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
Large-scale knowledge bases (e.g., Wikipedia/Wikidata) suffer from hallucination and inefficiency in question answering. Method: We propose a “question-question matching” retrieval paradigm: instruction-tuned LLMs (e.g., Llama-3) generate multi-perspective questions for each knowledge unit; these questions are embedded into a dense vector space using Sentence-BERT or ColBERT. User queries are matched directly against the precomputed question index—enabling zero-shot, generation-free, semantically aligned knowledge access. Crucially, this approach replaces document-level retrieval with question-level retrieval and integrates Wikidata’s RDF schema for structured fact mapping. Contributions/Results: Experiments on Wikipedia and Wikidata achieve >90% top-1 accuracy, sub-100ms latency, and support multimodal (text + multimedia) QA. The method significantly improves scalability, reliability, and retrieval precision while eliminating LLM hallucination.

Technology Category

Application Category

📝 Abstract
This paper introduces an approach to question answering over knowledge bases like Wikipedia and Wikidata by performing"question-to-question"matching and retrieval from a dense vector embedding store. Instead of embedding document content, we generate a comprehensive set of questions for each logical content unit using an instruction-tuned LLM. These questions are vector-embedded and stored, mapping to the corresponding content. Vector embedding of user queries are then matched against this question vector store. The highest similarity score leads to direct retrieval of the associated article content, eliminating the need for answer generation. Our method achieves high cosine similarity (>0.9 ) for relevant question pairs, enabling highly precise retrieval. This approach offers several advantages including computational efficiency, rapid response times, and increased scalability. We demonstrate its effectiveness on Wikipedia and Wikidata, including multimedia content through structured fact retrieval from Wikidata, opening up new pathways for multimodal question answering.
Problem

Research questions and friction points this paper is trying to address.

Large-scale Knowledge Bases
Accuracy
Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Answer Retrieval
Semantic Question Generation
Multimedia Dataset Compatibility