LLM Memory & RAG 相关度: 10/10

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng

arXiv: 2604.00715v1 发布: 2026-04-01 更新: 2026-04-01

下载 PDF arXiv 页面

AI 摘要

研究预训练数据量和检索数据量之间的权衡，为RAG系统的数据分配提供指导。

主要贡献

提出了一个三维缩放框架，用于建模模型大小、预训练tokens和检索语料库大小对性能的影响。
量化了检索对模型性能的提升，并分析了其边际效用与模型规模和任务类型的关系。
为固定数据预算下，预训练和检索的最佳分配提供了实用建议。

方法论

通过训练不同规模的OLMo-2模型，并控制预训练数据量和检索语料库大小，评估了模型在多个基准测试上的表现。

原文摘要

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

arXiv 分类

cs.CL cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类