ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
AI 摘要
ZipServ通过硬件感知的无损压缩,加速并降低LLM推理的内存占用。
主要贡献
- 提出了Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE) 格式
- 设计了fused decompression-GEMM (ZipGEMM) kernel
- 实现了load-compressed, compute-decompressed的计算模式
方法论
ZipServ通过新型固定长度压缩格式和融合的解压缩-GEMM内核,在GPU上直接解压并计算,减少内存访问和中间缓存。
原文摘要
Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed" design eliminates intermediate buffers and maximizes compute intensity. Experiments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21x kernel-level speedup over NVIDIA's cuBLAS, and expedites end-to-end inference by an average of 1.22x over vLLM. ZipServ is the first lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GPUs.