The Wikidata Query Logs Dataset
AI 摘要
论文提出了一个大规模的Wikidata问答数据集WDQL,用于训练问答系统。
主要贡献
- 构建了一个包含200k问答对的Wikidata数据集WDQL。
- 提出了一种基于Agent的方法,用于从匿名SPARQL查询中生成自然语言问题。
- 验证了该数据集在训练问答方法上的有效性。
方法论
使用基于Agent的方法迭代地匿名化、清洗和验证Wikidata上的SPARQL查询,并生成对应的问题。
原文摘要
We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.