LLM Memory & RAG 相关度: 8/10

Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov
arXiv: 2602.11151v1 发布: 2026-02-11 更新: 2026-02-11

AI 摘要

论文提出了pplx-embed系列多语言嵌入模型,利用扩散预训练模型提升检索性能,并在多个benchmark上取得优异结果。

主要贡献

  • 提出pplx-embed系列模型,包括pplx-embed-v1和pplx-embed-context-v1
  • 利用扩散预训练语言模型作为backbone,提升上下文理解能力
  • 在MTEB、MIRACL、ConTEB等benchmark上取得SOTA或竞争性结果

方法论

采用多阶段对比学习,利用扩散预训练语言模型捕捉双向上下文信息,结合平均池化和late chunking策略处理长文档。

原文摘要

In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.

标签

嵌入模型 多语言 检索 对比学习 扩散模型

arXiv 分类

cs.LG cs.CL cs.IR