Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents
AI 摘要
论文评估了不同chunking策略在油气企业文档RAG中的表现,发现结构感知chunking效果较好,但P&ID处理能力不足。
主要贡献
- 对比了四种chunking策略在油气领域文档上的性能
- 发现结构感知chunking在检索效果和计算成本上具有优势
- 指出了现有方法在处理图像类文档上的局限性
方法论
在油气企业文档语料库上,对比固定大小滑动窗口、递归、语义和结构感知chunking策略的检索效果,使用top-K指标评估。
原文摘要
Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.