AI Agents 相关度: 9/10

PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving

Sunghyeon Woo, Hoseung Kim, Sunghwan Shim, Minjung Jo, Hyunjoon Jeong, Jeongtae Lee, Joonghoon Kim, Sungjae Lee, Baeseong Park, Se Jung Kwon, Dongsoo Lee

arXiv: 2602.12029v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

PrefillShare通过共享预填充模块，显著降低多LLM系统延迟，提升吞吐量。

主要贡献

提出 PrefillShare 算法，共享预填充阶段
设计了基于 vLLM 的异构模型路由机制
实验证明在多模型agent任务中性能提升

方法论

将模型分解为预填充和解码模块，冻结预填充模块，仅微调解码模块，实现预填充共享。

原文摘要

Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.

arXiv 分类

cs.LG cs.DC

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类