Agent Tuning & Optimization 相关度: 7/10

Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones
arXiv: 2603.23198v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

该论文通过引入稀疏性和优化CUDA内核,提升了Transformer语言模型的推理和训练效率。

主要贡献

  • 提出新的稀疏打包格式和CUDA内核
  • 证明了L1正则化可以实现高稀疏性且性能影响小
  • 验证了稀疏性在提升吞吐量、能效和内存使用方面的优势

方法论

通过L1正则化诱导模型稀疏性,设计高效的CUDA内核,并在LLM上进行实验验证。

原文摘要

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

标签

Transformer 稀疏性 CUDA LLM 优化

arXiv 分类

cs.LG cs.CL