To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
AI 摘要
研究预训练数据量和检索数据量之间的权衡,为RAG系统的数据分配提供指导。
主要贡献
- 提出了一个三维缩放框架,用于建模模型大小、预训练tokens和检索语料库大小对性能的影响。
- 量化了检索对模型性能的提升,并分析了其边际效用与模型规模和任务类型的关系。
- 为固定数据预算下,预训练和检索的最佳分配提供了实用建议。
方法论
通过训练不同规模的OLMo-2模型,并控制预训练数据量和检索语料库大小,评估了模型在多个基准测试上的表现。
原文摘要
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.