The Wikidata Query Logs Dataset

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitation of existing Wikidata question-answering datasets, which predominantly rely on template-based generation and fail to capture the complexity of real user queries. To overcome this, the authors propose a novel agent-driven iterative framework that leverages authentic SPARQL query logs from the Wikidata Query Service. Through a pipeline involving de-anonymization, cleaning, semantic validation, and natural language generation, the framework automatically constructs high-quality question-answer pairs without dependence on predefined templates, thereby significantly enhancing data authenticity and diversity. The resulting dataset, WDQL, comprises 200,000 question-answer pairs—more than six times larger than current comparable resources—and demonstrates substantial improvements in downstream question-answering performance. All data and code are publicly released to support further research.

Technology Category

Application Category

📝 Abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

Problem

Research questions and friction points this paper is trying to address.

Wikidata

question-answering

query logs

SPARQL

knowledge graph

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wikidata Query Logs

agent-based query cleaning

real-world SPARQL queries