LLM Memory & RAG 相关度: 8/10

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiaowen Chu

arXiv: 2603.17435v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

ZipServ通过硬件感知的无损压缩，加速并降低LLM推理的内存占用。

主要贡献

提出了Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE) 格式
设计了fused decompression-GEMM (ZipGEMM) kernel
实现了load-compressed, compute-decompressed的计算模式

方法论

ZipServ通过新型固定长度压缩格式和融合的解压缩-GEMM内核，在GPU上直接解压并计算，减少内存访问和中间缓存。

原文摘要

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.

arXiv 分类

cs.DC cs.AR cs.LG cs.PF

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类