LLM Memory & RAG 相关度: 9/10

Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation

Julian Oestreich, Maximilian Bley, Frank Binder, Lydia Müller, Maksym Sydorenko, André Alcalde

arXiv: 2603.23047v1 发布: 2026-03-24 更新: 2026-03-24

下载 PDF arXiv 页面

AI 摘要

针对EDA领域，论文提出了RAG微调方法，并设计了新的评估指标，验证了小模型的有效性。

主要贡献

提出TriFEX，一种基于三元组的评估pipeline，用于评估RAG生成质量
提出PKP指标，用于评估RAG模型内部知识的准确性
证明现有知识内化指标对检索敏感
验证了微调的7B模型优于72B baseline

方法论

对7B模型进行RAG微调，采用五种上下文增强策略，并使用TriFEX和PKP评估生成结果。

原文摘要

Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.

arXiv 分类

cs.CL cs.AI cs.CE

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类