LLM Memory & RAG 相关度: 8/10

QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, Harper Langston

arXiv: 2602.08722v1 发布: 2026-02-09 更新: 2026-02-09

下载 PDF arXiv 页面

AI 摘要

QUOKA是一种面向查询的KV选择算法，通过减少KV对数量加速LLM推理，同时保持精度。

主要贡献

提出了一种新的稀疏注意力算法QUOKA
基于查询与平均查询的余弦相似度进行KV选择
实现了在多种硬件平台上的加速

方法论

QUOKA首先保留少量代表性查询，然后选择与这些查询最相关的keys。利用低余弦相似度查询进行更充分的注意力计算。

原文摘要

We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类