LLM Reasoning 相关度: 8/10

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei
arXiv: 2603.05232v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

SlideSparse解锁稀疏张量核心加速,提升(2N-2):2N稀疏模式下LLM推理速度。

主要贡献

  • 提出SlideSparse系统,支持(2N-2):2N稀疏模式在通用GPU上的加速。
  • 使用滑动窗口分解将(2N-2):2N权重块转换为兼容2:4稀疏模式。
  • 激活提升将激活重排融合到每token量化中,成本几乎为零。

方法论

通过滑动窗口分解重构稀疏权重,激活提升融合激活重排,实现稀疏张量核心的加速。

原文摘要

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

标签

稀疏性 LLM加速 GPU vLLM 量化

arXiv 分类

cs.LG