LLM Memory & RAG 相关度: 9/10

RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

Süha Kağan Köse, Mehmet Can Baytekin, Burak Aktaş, Bilge Kaan Görür, Evren Ayberk Munis, Deniz Yılmaz, Muhammed Yusuf Kartal, Çağrı Toraman
arXiv: 2602.03652v1 发布: 2026-02-03 更新: 2026-02-03

AI 摘要

该论文构建了土耳其语RAG数据集,并评估了不同RAG流程的性能,优化土耳其语RAG系统。

主要贡献

  • 构建了土耳其语RAG数据集
  • 评估了不同RAG流程在土耳其语上的性能
  • 提出了针对土耳其语RAG的优化方法

方法论

构建土耳其语数据集,基准测试RAG流程各个阶段,对比不同方法的效果,寻找帕累托最优配置。

原文摘要

Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.

标签

RAG 土耳其语 检索增强生成 多语言

arXiv 分类

cs.CL cs.AI cs.IR