LLM Reasoning 相关度: 7/10

The Wikidata Query Logs Dataset

Sebastian Walter, Hannah Bast
arXiv: 2602.14594v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

论文提出了一个大规模的Wikidata问答数据集WDQL,用于训练问答系统。

主要贡献

  • 构建了一个包含200k问答对的Wikidata数据集WDQL。
  • 提出了一种基于Agent的方法,用于从匿名SPARQL查询中生成自然语言问题。
  • 验证了该数据集在训练问答方法上的有效性。

方法论

使用基于Agent的方法迭代地匿名化、清洗和验证Wikidata上的SPARQL查询,并生成对应的问题。

原文摘要

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.

标签

Wikidata 问答系统 SPARQL 数据集

arXiv 分类

cs.CL