LLM Memory & RAG 相关度: 8/10

WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović, Michael Granitzer
arXiv: 2602.17327v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

WebFAQ 2.0发布,扩展了多语言FAQ问答数据集,并提供硬负例用于训练稠密检索模型。

主要贡献

  • 构建大规模多语言FAQ问答数据集
  • 提供带cross-encoder评分的硬负例数据集
  • 验证了硬负例在稠密检索模型上的训练效果

方法论

从网络抓取FAQ,使用两阶段检索流程挖掘硬负例,并使用对比学习和知识蒸馏进行模型训练。

原文摘要

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.

标签

问答系统 多语言 信息检索 硬负例挖掘 稠密检索

arXiv 分类

cs.IR cs.AI cs.CL