LLM Memory & RAG 相关度: 9/10

Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş, Amin Dada, Julian Friedrich, François Beaulieu, Paul Vozila, Jens Kleesiek
arXiv: 2602.04731v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出STM框架,通过合成数据、提示优化和模型合并,高效提升LLM在生物医学检索任务上的性能。

主要贡献

  • 提出Synthesize-Train-Merge (STM) 框架
  • 利用合成硬负样本提升检索性能
  • 通过模型合并提升领域适应性

方法论

STM框架包含:生成合成硬负样本、检索提示优化以及模型合并,无需大量预训练即可提升性能。

原文摘要

Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.

标签

RAG LLM 生物医学 检索

arXiv 分类

cs.CL cs.LG