LLM Reasoning 相关度: 7/10

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy
arXiv: 2603.10913v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

LLM2Vec-Gen提出一种新的自监督方法,通过学习LLM的潜在输出来生成高质量文本嵌入。

主要贡献

  • 提出了一种新的自监督嵌入方法LLM2Vec-Gen。
  • 在MTEB上取得了state-of-the-art的自监督性能。
  • 降低了有害内容检索,提高了推理能力。

方法论

通过添加可训练的特殊token到LLM的词汇表中,优化这些token来表示LLM对输入的潜在响应,利用LLM自身补全和无监督嵌入教师进行训练。

原文摘要

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

标签

text embedding self-supervised learning large language model

arXiv 分类

cs.CL